mpi4py: Regression in alltoall with intercommunicators #12367

dalcinl · 2024-02-25T07:33:53Z

https://github.com/mpi4py/mpi4py-testing/actions/runs/8034284372/job/21945704643

testAlltoallInter (test_util_pkl5.TestMPIWorld.testAlltoallInter) ... python: pml_ob1_comm.c:51: mca_pml_ob1_comm_proc_destruct: Assertion `NULL == proc->frags_cant_match' failed.

BTW, I've submitted #12357 to re-enable multi-proc runs of mpi4py tests. I've just rebased that PR, but now test may fail the same way I'm reporting here.

The text was updated successfully, but these errors were encountered:

jsquyres · 2024-02-25T14:55:14Z

The specific error is:

testAlltoallInter (test_util_pkl5.TestMPIWorld.testAlltoallInter) ... python: pml_ob1_comm.c:51: mca_pml_ob1_comm_proc_destruct: Assertion `NULL == proc->frags_cant_match' failed.
[fv-az979-137:145684] *** Process received signal ***
[fv-az979-137:145684] Signal: Aborted (6)
[fv-az979-137:145684] Signal code:  (-6)

dalcinl · 2024-02-25T16:28:19Z

The specific error is:

What do you mean? That's exactly what I reported in the description right below the link to the full logs . Am I missing something?

jsquyres · 2024-02-25T20:07:41Z

That's exactly what I reported in the description right below the link to the full logs . Am I missing something?

Errr... no, sorry. You didn't report the version of Open MPI, and I had to dig into the CI link to find it; I labeled that then copy-and-pasted the specific error out of the logs mostly on auto-pilot (my eyes skipped right over the long line in your description and I missed it).

hppritcha · 2024-03-04T19:34:43Z

@dalcinl could you remind me which arg to main.py controls the set of tests which are run?

hppritcha · 2024-03-04T19:45:12Z

okay I figured that out. I'm not able to reproduce this problem running standalone.

bosilca · 2024-03-04T23:58:43Z

This errors is only possible when out-of-sequence messages are left into a process queue when the communicator is freed or when MPI_Finalize is called. This means there are messages unmatched and out-of-sequence, kind of suggesting some incorrect MPI application.

dalcinl · 2024-03-06T11:00:13Z

This failing test is doing a series of shot-message Python (pickle-based) comm.alltoall(...) on an intercomm in rapid succession. The test basically reads

comm = make_intercomm(MPI.COMM_WORLD)
size = comm.Get_remote_size()
for smess in messages:
    rmess = comm.alltoall(None)
    assert rmess == [None]*size
    rmess = comm.alltoall([smess]*size
    assert rmess == [smess]*size
    rmess = comm.alltoall(iter([smess]*size))
    assert rmess == [smess]*size
comm.Free()

The algorithm has not change in 15+ years: 1) MPI_Alltoall of counts, and 2) MPI_Alltoallv of the (pickled) byte stream.
https://github.com/mpi4py/mpi4py/blob/master/src/mpi4py/MPI/msgpickle.pxi#L908

This test has been working on MPICH/Intel MPI/Microsoft MPI on Linux/macOS/Windows. I never gave a problem with Open MPI v4.1.x, v5.0.x, and until recently, the main branch. Moreover, the issue is not easy to reproduce, and the test even pass most times in GitHub Actions.

While I understand it is tempting to assume that the problem is on the application (that is, mpi4py) side, IMHO all the evidence points to the backend. A sequence of MPI_Alltoall+MPI_Alltoallv calls occurring simultaneously at all processes should not end up in unmatched messages, unless the MPI implementation is internally messing up things.

I'm indeed freeing the intercommunicator (with Free()) after the many rounds of MPI_Alltoall+MPI_Alltoallv calls. Maybe I should use Disconnect() instead, then rather than an assertion from unmatched messages, we should end up in a deadlock.

ggouaillardet · 2024-03-06T11:19:08Z

@dalcinl is there a command to run only the test_util_pkl5.TestMPIWorld.testAlltoallInter test?

also, what if you mpirun --mca coll ^han ...

dalcinl · 2024-03-06T11:34:41Z

@dalcinl is there a command to run only the test_util_pkl5.TestMPIWorld.testAlltoallInter test?

The main entry point to run everything is test/main.py, but every test/test_*.py script can be run individually. In turn, there are a few different ways to run a specific test (all following Python's standard unittest infrastructure), but the most obvious/easy one probably is:

mpiexec -n 5 python test/main.py -v -k test_util_pkl5.TestMPIWorld.testAlltoallInter

PS: Run python test/main.py --help for all the options...

also, what if you mpirun --mca coll ^han ...

I'll try locally, but given that I cannot reproduce that way, I'm not sure such a try will be really relevant.

dalcinl · 2024-03-06T11:55:09Z

OK, now I can reproduce. I have a machine with 128 virtual cores.

If I run with 127 mpi4py processes it seems to work most of the times:

mpiexec -n 127 python test/main.py -v -k test_util_pkl5.TestMPIWorld.testAlltoallInter

but as soon as I ask for 128 MPI processes, then the assertion triggers:

mpiexec -n 127 python test/main.py -v -k test_util_pkl5.TestMPIWorld.testAlltoallInter
...
[31@kw61149] numpy 1.26.4 (/home/dalcinl/.local/lib/python3.12/site-packages/numpy)
[31@kw61149] MPI 3.1 (Open MPI 5.1.0)
[31@kw61149] mpi4py 4.0.0.dev0 (/home/dalcinl/Devel/mpi4py/build/lib.linux-x86_64-cpython-312/mpi4py)
testAlltoallInter (test_util_pkl5.TestMPIWorld.testAlltoallInter) ... python: ../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_comm.c:51: mca_pml_ob1_comm_proc_destruct: Assertion `NULL == proc->frags_cant_match' failed.
[kw61149:3686873] *** Process received signal ***
[kw61149:3686873] Signal: Aborted (6)
[kw61149:3686873] Signal code:  (-6)
[kw61149:3686873] [ 0] /lib64/libc.so.6(+0x3e9a0)[0x7f7eca2a39a0]
[kw61149:3686873] [ 1] /lib64/libc.so.6(+0x90834)[0x7f7eca2f5834]
[kw61149:3686873] [ 2] /lib64/libc.so.6(raise+0x1e)[0x7f7eca2a38ee]
[kw61149:3686873] [ 3] /lib64/libc.so.6(abort+0xdf)[0x7f7eca28b8ff]
[kw61149:3686873] [ 4] /lib64/libc.so.6(+0x2681b)[0x7f7eca28b81b]
[kw61149:3686873] [ 5] /lib64/libc.so.6(+0x36c57)[0x7f7eca29bc57]
[kw61149:3686873] [ 6] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x377734)[0x7f7ebb177734]
[kw61149:3686873] [ 7] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x3771b5)[0x7f7ebb1771b5]
[kw61149:3686873] [ 8] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x377ce5)[0x7f7ebb177ce5]
[kw61149:3686873] [ 9] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x3736a7)[0x7f7ebb1736a7]
[kw61149:3686873] [10] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(mca_pml_ob1_del_comm+0xe1)[0x7f7ebb17584d]
[kw61149:3686873] [11] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x6345b)[0x7f7ebae6345b]
[kw61149:3686873] [12] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x6440c)[0x7f7ebae6440c]
[kw61149:3686873] [13] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_comm_free+0x48e)[0x7f7ebae6a4fc]
[kw61149:3686873] [14] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(PMPI_Comm_free+0x11d)[0x7f7ebaedf8cc]
[kw61149:3686873] [15] /home/dalcinl/Devel/mpi4py/build/lib.linux-x86_64-cpython-312/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x14ae82)[0x7f7ebb54ae82]
[kw61149:3686873] [16] /home/dalcinl/Devel/mpi4py/build/lib.linux-x86_64-cpython-312/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x14ae29)[0x7f7ebb54ae29]
[kw61149:3686873] [17] /home/dalcinl/Devel/mpi4py/build/lib.linux-x86_64-cpython-312/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x338511)[0x7f7ebb738511]
[kw61149:3686873] [18] /lib64/libpython3.12.so.1.0(PyObject_Vectorcall+0x5c)[0x7f7ec9e0ae9c]
[kw61149:3686873] [19] /lib64/libpython3.12.so.1.0(+0x10fb3a)[0x7f7ec9d0fb3a]
[kw61149:3686873] [20] /lib64/libpython3.12.so.1.0(+0x234b7d)[0x7f7ec9e34b7d]
[kw61149:3686873] [21] /lib64/libpython3.12.so.1.0(+0x1105fb)[0x7f7ec9d105fb]
[kw61149:3686873] [22] /lib64/libpython3.12.so.1.0(_PyObject_FastCallDictTstate+0xa6)[0x7f7ec9df9b66]
[kw61149:3686873] [23] /lib64/libpython3.12.so.1.0(_PyObject_Call_Prepend+0x73)[0x7f7ec9e1e4d3]
[kw61149:3686873] [24] /lib64/libpython3.12.so.1.0(+0x2cb485)[0x7f7ec9ecb485]
[kw61149:3686873] [25] /lib64/libpython3.12.so.1.0(_PyObject_MakeTpCall+0x76)[0x7f7ec9df7466]
[kw61149:3686873] [26] /lib64/libpython3.12.so.1.0(+0x10fb3a)[0x7f7ec9d0fb3a]
[kw61149:3686873] [27] /lib64/libpython3.12.so.1.0(+0x234b7d)[0x7f7ec9e34b7d]
[kw61149:3686873] [28] /lib64/libpython3.12.so.1.0(+0x1105fb)[0x7f7ec9d105fb]
[kw61149:3686873] [29] /lib64/libpython3.12.so.1.0(_PyObject_FastCallDictTstate+0xa6)[0x7f7ec9df9b66]
[kw61149:3686873] *** End of error message ***
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

The --mca ^han option is not working on my side, but most likely because the option syntax is incorrect?

mpiexec -n 2 --mca ^han python test/main.py -v -k test_util_pkl5.TestMPIWorld.testAlltoallInter
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

ggouaillardet · 2024-03-06T12:03:14Z

my bad, the correct syntax is

mpirun --mca coll ^han ...

dalcinl · 2024-03-06T12:39:32Z

mpirun --mca coll ^han ...

I figured out, eventually... it does not fix the issue.

ggouaillardet · 2024-03-06T15:17:27Z

I confirm I was able to reproduce the issue.

The inline reproducer can be used to evidence it.
In my environment, it crashes with 13 MPI tasks

$ mpirun -np 13 -oversubscribe --mca btl tcp,self --mca pml ob1 ./alltoallinter

#include <mpi.h>
#include <stdio.h>

MPI_Comm intercomm;

int main(int argc, char *argv[]) {
    int rank, size, lrank, lsize, rsize;
    MPI_Comm local;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_split(MPI_COMM_WORLD, (rank<size/2)?0:1, 0, &local);
    MPI_Comm_rank(local, &lrank);
    MPI_Comm_size(local, &lsize);
    printf("I am %d: %d / %d\n", rank, lrank, lsize);
    MPI_Intercomm_create(local, 0, MPI_COMM_WORLD, (rank<size/2)?size/2:0, 0, &intercomm);
    MPI_Comm_remote_size(intercomm, &rsize);
    size_t sbuf[size], rbuf[size];
    MPI_Request reqs[size*2];
    MPI_Request *req = reqs;
    int i;
    for (i=0; i<rsize; i++, req++) {
        MPI_Irecv(rbuf+i, 1, MPI_COUNT, i, 0, intercomm, req);
    }
    for (i=0; i<rsize; i++, req++) {
        MPI_Isend(sbuf+i, 1, MPI_COUNT, i, 0, intercomm, req);
    }
    MPI_Waitall(rsize*2, reqs, MPI_STATUSES_IGNORE);
    
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Comm_free(&intercomm);
    MPI_Comm_free(&local);
    MPI_Finalize();
    return 0;
}

hppritcha · 2024-03-06T15:25:39Z

I still can't reproduce either the above 'c' code or mpi4py test at 128 ranks.
I notice when using the UCX PML however that it emits these warnings:

[1709738640.598286] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8dfc0 was not matched
[1709738640.598305] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8df00 was not matched
[1709738640.598314] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8de40 was not matched
[1709738640.598321] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8dd80 was not matched
[1709738640.598326] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8dcc0 was not matched
[1709738640.598331] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8dc00 was not matched
[1709738640.598336] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8db40 was not matched
[1709738640.598341] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8da80 was not matched
[1709738640.598347] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8d9c0 was not matched
[1709738640.598414] [st-master:755723:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x2edfdfc0 was not matched
[1709738640.598431] [st-master:755723:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x2edfdf00 was not matched
[1709738640.598440] [st-master:755723:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x2edfde40 was not matched
[1709738640.598353] [st-master:755717:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x3fd8d900 was not matched

wenduwan · 2024-03-06T15:26:16Z

Is this main branch only, or also affecting 5.0?

hppritcha · 2024-03-06T15:29:39Z

i have not checked @wenduwan the 5.0.x branch. UCX seems to be complaining that a receive that had been posted was never matched when the app called mpi_finalize.

hppritcha · 2024-03-06T16:03:05Z

i'm not sure what @ggouaillardet code snippet is suppose to be demonstrating. its only doing a bunch if isends.

ggouaillardet · 2024-03-06T16:20:07Z

The program crashes in my environment (most of the time) with 13 tasks. My box only has 4 cores, so maybe oversubscription helps evidencing the issue.
I will complete git bisect tomorrow.
Note if I perform the same communications on MPI_COMM_WORLD, there is no issue.

hppritcha · 2024-03-06T16:20:32Z

your example is just doing isends. where are the matching receives?

dalcinl · 2024-03-06T16:27:10Z

@wenduwan I had not seen the issue with v5.0.x on GHA so far. I tried locally with 128 and 256 MPI processes many times, and the test runs to completion without any issue.

@hppritcha I'm using the following configure options, note --enable-debug and --enable-mem-debug, plus no external libs:

options=(
    --prefix=/home/devel/mpi/openmpi/$branch
    --without-ofi
    --without-ucx
    --without-psm2
    --without-cuda
    --without-rocm
    --with-pmix=internal
    --with-prrte=internal
    --with-libevent=internal
    --with-hwloc=internal
    --enable-debug
    --enable-mem-debug
)

hppritcha · 2024-03-06T16:29:47Z

yep i'm using the --enable-debug. i've not used --enable-mem-debug but i'll add that.

dalcinl · 2024-03-06T16:37:01Z

BTW, just to add reinforce my point that I don't believe this issue is mpi4py's fault, I run 128 and 256 processes with v4.1.x (same configure options as above), and all seems to be good.

ggouaillardet · 2024-03-06T16:37:29Z

Oops. The first call should be MPI_Irecv(), I will double check that. This was supposed to be a MPI_Alltoall() that can also be used to evidence the issue.

bosilca · 2024-03-06T17:22:09Z

I confirm, ob1 seems to be broken (independent of the used BTL). I altered @ggouaillardet code above such that on deadlock one can do call mca_pml_ob1_dump(intercomm, 1) in the debugger to get the state of the communications on the intercom. There are unmatched receives, as if the sends were either not sent or put into a local queue (a such as the cant_match queue and never extracted.

dalcinl · 2024-03-06T18:59:19Z

In case it is useful, the first time I noticed the error was about 2 week ago while running commit cb00772.

bosilca · 2024-03-06T19:10:40Z

Don't think that particular commit can have any impact (none of our machines would be using the usnic BTL). But clearly we need to do some digging to find what broke the intercomm support or the PML.

dalcinl · 2024-03-06T19:15:17Z

Don't think that particular commit can have any impact

Definitely. My whole point was to speedup a bit the bisection. The faulty commit is hopefully not too much older than the one I reported above.

ggouaillardet · 2024-03-07T09:21:06Z

fwiw, git bisect points to 23df181

it seems some fixes were applied later, but that did not fix the issue we are discussing now.

ggouaillardet · 2024-03-07T10:21:45Z

here is the reproducer I am now using:

I noted that if I uncomment the MPI_Barrier() after MPI_Intercomm_create(), the program successes.

#include <stdio.h>

#include <mpi.h>

int main(int argc, char *argv[]) {
    int rank, size, lrank, lsize, rsize;
    MPI_Comm local, inter;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_split(MPI_COMM_WORLD, (rank<size/2)?0:1, 0, &local);
    MPI_Comm_rank(local, &lrank);
    MPI_Comm_size(local, &lsize);
    printf("I am %d: %d / %d\n", rank, lrank, lsize);
    MPI_Intercomm_create(local, 0, MPI_COMM_WORLD, (rank<size/2)?size/2:0, 0, &inter);
    // MPI_Barrier(MPI_COMM_WORLD);
    MPI_Comm_remote_size(inter, &rsize);
    size_t sbuf[size], rbuf[size];
    MPI_Request reqs[size*2];
    MPI_Request *req = reqs;
    int i;
    for (i=0; i<rsize; i++, req++) {
        MPI_Irecv(rbuf+i, 1, MPI_COUNT, i, 0, inter, req);
    }
    for (i=0; i<rsize; i++, req++) {
        MPI_Isend(sbuf+i, 1, MPI_COUNT, i, 0, inter, req);
    }
    MPI_Waitall(rsize*2, reqs, MPI_STATUSES_IGNORE);
    
    MPI_Comm_free(&inter);
    MPI_Comm_free(&local);
    MPI_Finalize();
    return 0;
}

ggouaillardet · 2024-03-07T12:51:48Z

The difference between now and before, is there is no more allreduce on inter communicators.
Based on the observation that MPI_Barrier() is enough to avoid the problem, a workaround is to re-introduce a (dummy) allreduce on inter communicators. I do not understand exactly what is gong wrong, but I guess allreduce() has a barrier() semantic that is somehow needed.
(the changes in libnbc is to avoid a warning on malloc(0))

diff --git a/ompi/communicator/comm_cid.c b/ompi/communicator/comm_cid.c
index be022b8..fa427c6 100644
--- a/ompi/communicator/comm_cid.c
+++ b/ompi/communicator/comm_cid.c
@@ -917,14 +917,14 @@ int ompi_comm_activate_nb (ompi_communicator_t **newcomm, ompi_communicator_t *c
          */
         ret = context->iallreduce_fn (&context->local_peers, &context->max_local_peers, 1, MPI_MAX, context,
                                       &subreq);
-        if (OMPI_SUCCESS != ret) {
-            ompi_comm_request_return (request);
-            return ret;
-        }
-        ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, &subreq, 1);
     } else {
-        ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, NULL, 0);
+        ret = context->iallreduce_fn (NULL, NULL, 0, MPI_MAX, context, &subreq);
+    }
+    if (OMPI_SUCCESS != ret) {
+        ompi_comm_request_return (request);
+        return ret;
     }
+    ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, &subreq, 1);
 
     ompi_comm_request_start (request);
 
diff --git a/ompi/mca/coll/libnbc/nbc_ireduce.c b/ompi/mca/coll/libnbc/nbc_ireduce.c
index 1f7464e..3052f2a 100644
--- a/ompi/mca/coll/libnbc/nbc_ireduce.c
+++ b/ompi/mca/coll/libnbc/nbc_ireduce.c
@@ -67,7 +67,7 @@ static int nbc_reduce_init(const void* sendbuf, void* recvbuf, int count, MPI_Da
   MPI_Aint ext;
   NBC_Schedule *schedule;
   char *redbuf=NULL, inplace;
-  void *tmpbuf;
+  void *tmpbuf = NULL;
   char tmpredbuf = 0;
   enum { NBC_RED_BINOMIAL, NBC_RED_CHAIN, NBC_RED_REDSCAT_GATHER} alg;
   ompi_coll_libnbc_module_t *libnbc_module = (ompi_coll_libnbc_module_t*) module;
@@ -126,25 +126,27 @@ static int nbc_reduce_init(const void* sendbuf, void* recvbuf, int count, MPI_Da
   }
 
   /* allocate temporary buffers */
-  if (alg == NBC_RED_REDSCAT_GATHER || alg == NBC_RED_BINOMIAL) {
-    if (rank == root) {
-      /* root reduces in receive buffer */
-      tmpbuf = malloc(span);
-      redbuf = recvbuf;
+  if (0 < span) {
+    if (alg == NBC_RED_REDSCAT_GATHER || alg == NBC_RED_BINOMIAL) {
+      if (rank == root) {
+        /* root reduces in receive buffer */
+        tmpbuf = malloc(span);
+        redbuf = recvbuf;
+      } else {
+        /* recvbuf may not be valid on non-root nodes */
+        ptrdiff_t span_align = OPAL_ALIGN(span, datatype->super.align, ptrdiff_t);
+        tmpbuf = malloc(span_align + span);
+        redbuf = (char *)span_align - gap;
+        tmpredbuf = 1;
+      }
     } else {
-      /* recvbuf may not be valid on non-root nodes */
-      ptrdiff_t span_align = OPAL_ALIGN(span, datatype->super.align, ptrdiff_t);
-      tmpbuf = malloc(span_align + span);
-      redbuf = (char *)span_align - gap;
-      tmpredbuf = 1;
+      tmpbuf = malloc (span);
+      segsize = 16384/2;
     }
-  } else {
-    tmpbuf = malloc (span);
-    segsize = 16384/2;
-  }
 
-  if (OPAL_UNLIKELY(NULL == tmpbuf)) {
-    return OMPI_ERR_OUT_OF_RESOURCE;
+    if (OPAL_UNLIKELY(0 < span && NULL == tmpbuf)) {
+      return OMPI_ERR_OUT_OF_RESOURCE;
+    }
   }
 
 #ifdef NBC_CACHE_SCHEDULE

dalcinl · 2024-03-08T08:37:22Z

@ggouaillardet Oh boy... this is the damn comm_cid stuff that has been causing grief elsewhere. Maybe that's the root of the problem, without the barrier, the whole thing is broken for intercomms. BTW, maybe a better fix would to actually use an ibarrier ?

bosilca · 2024-03-08T14:35:35Z

Let's not get ahead of the solution here. comm_cid selection algorithm is one of the most critical pieces on communicator management, it has hundreds of thousands tests on it every night and it performs exactly how is was designed.

What the last reduce (which in the case of the intercomm would give an answer we would not be able to use anyway) provides is an extra synchronization preventing processes from sending messages too early. However, this reduce being applied on the parent communicator it does indeed behave as a barrier for the creation of the new intercom and as such it delays arrivals of messages until the local parts of the communicators are up and running (this is also confirmed by the fact that your error was frelated to the frags_cant_match list). We had protection against this, but for some reason it does not seem to work correctly.

hppritcha · 2024-03-08T16:20:18Z

i was finally able to reproduce (sometimes) the mpi4py alltoall hang (although in a different place - test_cco_obj_inter) by setting up a VM with 4 cores and running oversubscribed by 1.

hppritcha · 2024-03-08T17:01:58Z

If I revert commit 23df181 I'm not seeing a hang when running mpirun -np 5 --map-by :OVERSUBSCRIBE ./main.py in a loop.

hppritcha · 2024-04-06T17:23:24Z

I am able to reproduce regularly with @ggouaillardet test case and am looking in to this.

hppritcha · 2024-04-11T14:25:12Z

okay, i figured out the problem here. There's either a design flaw in the way CIDS are being allocated or else we really do need a barrier like operation on the parent communicator used to created the new intercomm.

What i determined is that here in the OB1 PML, a bogus communicator is being returned (because the parent comm was putting as a place holder in the ompi_mpi_communicators

here's the code snippet in mca_pml_o1_recv_frag_callback_match:

    /* communicator pointer */
    comm_ptr = ompi_comm_lookup(hdr->hdr_ctx);
    if(OPAL_UNLIKELY(NULL == comm_ptr)) {
        /* This is a special case. A message for a not yet existing
         * communicator can happens. Instead of doing a matching we
         * will temporarily add it the a pending queue in the PML.
         * Later on, when the communicator is completely instantiated,
         * this pending queue will be searched and all matching fragments
         * moved to the right communicator.
         */
        append_frag_to_list( &mca_pml_ob1.non_existing_communicator_pending,
                             btl, hdr, segments, num_segments, NULL );
        return;
    }
    comm = (mca_pml_ob1_comm_t *)comm_ptr->c_pml_comm;

The problem is the ompi_comm_lookup. The way the vector of communicator handles is being used in the cid alloc algorithm, the parent communicator is used as a "placeholdr" in this array to help deal with the case of multiple threads trying to get a cid. BUT this communicator is not valid for use with the ctx in the hdr pkt. So rather than the pkt being routed on to the non_existing_communicator_pending it gets put on the wrong mca_pml_ob1_comm_proc_t struct.

Now in the comm_cid.c in 4.1.x branch here's the original code we had that related to this:


int ompi_comm_activate_nb (ompi_communicator_t **newcomm, ompi_communicator_t *comm,
                           ompi_communicator_t *bridgecomm, const void *arg0,
                           const void *arg1, bool send_first, int mode, ompi_request_t **req)
{
    ompi_comm_cid_context_t *context;
    ompi_comm_request_t *request;
    ompi_request_t *subreq;
    int ret = 0;

    context = mca_comm_cid_context_alloc (*newcomm, comm, bridgecomm, arg0, arg1, "activate",
                                          send_first, mode);
    if (NULL == context) {
        return OMPI_ERR_OUT_OF_RESOURCE;
    }

    /* keep track of the pointer so it can be set to MPI_COMM_NULL on failure */
    context->newcommp = newcomm;
                                              
    request = ompi_comm_request_get ();       
    if (NULL == request) {
        OBJ_RELEASE(context);
        return OMPI_ERR_OUT_OF_RESOURCE;
    }

    request->context = &context->super;
    
    if (MPI_UNDEFINED != (*newcomm)->c_local_group->grp_my_rank) {
        /* Initialize the PML stuff in the newcomm  */
        if ( OMPI_SUCCESS != (ret = MCA_PML_CALL(add_comm(*newcomm))) ) {
            OBJ_RELEASE(*newcomm);
            OBJ_RELEASE(context);
            *newcomm = MPI_COMM_NULL;
            return ret;
        }
        OMPI_COMM_SET_PML_ADDED(*newcomm);
    }

    /* Step 1: the barrier, after which it is allowed to
     * send messages over the new communicator
     */
    ret = context->allreduce_fn (&context->ok, &context->ok, 1, MPI_MIN, context,
                                 &subreq);
    if (OMPI_SUCCESS != ret) {
        ompi_comm_request_return (request);
        return ret;
    }

    ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, &subreq, 1);
    ompi_comm_request_start (request);

    *req = &request->super;

    return OMPI_SUCCESS;
}

but now in main and 5.0.x it looks like this

int ompi_comm_activate_nb (ompi_communicator_t **newcomm, ompi_communicator_t *comm,
                           ompi_communicator_t *bridgecomm, const void *arg0,
                           const void *arg1, bool send_first, int mode, ompi_request_t **req)
{
    ompi_comm_cid_context_t *context;
    ompi_comm_request_t *request;
    ompi_request_t *subreq;
    int ret = 0;

    /* the caller should not pass NULL for comm (it may be the same as *newcomm) */
    assert (NULL != comm);
    context = mca_comm_cid_context_alloc (*newcomm, comm, bridgecomm, arg0, arg1, "activate",
                                          send_first, mode);
    if (NULL == context) {
        return OMPI_ERR_OUT_OF_RESOURCE;
    }

    /* keep track of the pointer so it can be set to MPI_COMM_NULL on failure */
    context->newcommp = newcomm;

    request = ompi_comm_request_get ();
    if (NULL == request) {
        OBJ_RELEASE(context);
        return OMPI_ERR_OUT_OF_RESOURCE;
    }

    request->context = &context->super;

    if (MPI_UNDEFINED != (*newcomm)->c_local_group->grp_my_rank) {
        /* Initialize the PML stuff in the newcomm  */
        if ( OMPI_SUCCESS != (ret = MCA_PML_CALL(add_comm(*newcomm))) ) {
            OBJ_RELEASE(*newcomm);
            OBJ_RELEASE(context);
            *newcomm = MPI_COMM_NULL;
            return ret;
        }
        OMPI_COMM_SET_PML_ADDED(*newcomm);
    }

    if (OMPI_COMM_IS_INTRA(*newcomm)) {
        /**
         * The communicator's disjointness is inferred from max_local_peers.
         * Note: MPI_IN_PLACE cannot be used here because the parent could be an
         * inter-communicator
         */
        ret = context->iallreduce_fn (&context->local_peers, &context->max_local_peers, 1, MPI_MAX, context,
                                      &subreq);
        if (OMPI_SUCCESS != ret) {
            ompi_comm_request_return (request);
            return ret;
        }
        ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, &subreq, 1);
    } else {
        ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, NULL, 0);
    }

    ompi_comm_request_start (request);

    *req = &request->super;

    return ret;
}

Its unfortunate that the useful note about barrier comment was removed. would have saved a week or so of debugging.
I think @bosilca was wrong in his assertion that a barrier is not needed in the cid alloc algorithm.

wenduwan · 2024-04-11T14:29:16Z

@hppritcha A minor correction - the barrier still exists on v5.0.x - it's only removed in main.

hppritcha · 2024-04-11T14:30:15Z

ah good point @wenduwan .

bosilca · 2024-04-11T20:02:11Z

The existence of the code protecting early arrival of messages proves that I am correct in my assessment that a barrier is, or at least was at some point, not necessary. Indeed, if the barrier was necessary there would be no need for non_existing_communicator_pending. Which means that the rework of the communicator creation, added the need for the barrier as a "cheaper" way of avoid other issues. I say "cheaper" because we now need about 4 to 5 collective in order to create a communicator, doubling the number of collective needed previously.

Going back to your analysis that we put the parent communicator in the ompi_mpi_communicators array. I don't see any obvious reason to do so, more precisely not a valid reason to have to put the parent. If instead we put the newly created communicator, aka. context->newcomm instead of context->comm, we should be good because if the cid allreduce fails to produce a valid cid nobody will send a message with that cid, and if it succeeded anywhere (which will justify receiving an early message) it will be added to the correct communicator.

A barrier was removed from the cid allocation algorithm for intercommunicators that leads to random hangs when doing MPI_Intercomm_create or moral equivalent. See issue open-mpi#12367 the cid allocation algorithm and comm from cid lookup was modified to use a pointer to ompi_mpi_comm_null as the placeholder in the list of pointers to ompi commucators. A convenience debug function was added to comm_cid.c to dump out this table of communicators. Signed-off-by: Howard Pritchard <howardp@lanl.gov>

A barrier was removed from the cid allocation algorithm for intercommunicators that leads to random hangs when doing MPI_Intercomm_create or moral equivalent. See issue open-mpi#12367 the cid allocation algorithm and comm from cid lookup was modified to use OMPI_COMM_SENTINEL as the placeholder in the list of pointers to ompi commucators. Signed-off-by: Howard Pritchard <howardp@lanl.gov>

dalcinl · 2024-04-15T12:35:38Z

Guys, may I gently request to add the barrier back until a definitive and more performant solution is found? Otherwise, this issue keeps giving trouble when testing unrelated developments like #12384.

A barrier was removed from the cid allocation algorithm for intercommunicators that leads to random hangs when doing MPI_Intercomm_create or moral equivalent. See issue open-mpi#12367 the cid allocation algorithm and comm from cid lookup was modified to use OMPI_COMM_SENTINEL as the placeholder in the list of pointers to ompi commucators. Signed-off-by: Howard Pritchard <howardp@lanl.gov>

hppritcha · 2024-04-17T20:23:20Z

fixed via #12465

dalcinl · 2024-04-18T10:11:47Z

Thanks, @hppritcha !!!

A barrier was removed from the cid allocation algorithm for intercommunicators that leads to random hangs when doing MPI_Intercomm_create or moral equivalent. See issue open-mpi#12367 the cid allocation algorithm and comm from cid lookup was modified to use OMPI_COMM_SENTINEL as the placeholder in the list of pointers to ompi commucators. Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 15a3d26)

jsquyres added the Target: main label Feb 25, 2024

jsquyres added the bug label Feb 25, 2024

hppritcha self-assigned this Feb 27, 2024

This was referenced Mar 2, 2024

Fix compiler warning #12389

Merged

ULFM: Support for intercomm [i]agree and [i]shrink #12384

Merged

hppritcha mentioned this issue Mar 8, 2024

communicator: fix max_local_peers value in disjoint function #12223

Merged

hppritcha removed their assignment Mar 11, 2024

dalcinl mentioned this issue Mar 19, 2024

mpi4py: Regression in with OMPI at branch v5.0.x #12407

Closed

hppritcha self-assigned this Apr 6, 2024

hppritcha mentioned this issue Apr 11, 2024

comm/cid: fix for edge case with intercomm creation #12465

Merged

hppritcha closed this as completed Apr 17, 2024

mpi4py: Regression in alltoall with intercommunicators #12367

mpi4py: Regression in alltoall with intercommunicators #12367

Comments

dalcinl commented Feb 25, 2024

jsquyres commented Feb 25, 2024

dalcinl commented Feb 25, 2024

jsquyres commented Feb 25, 2024

hppritcha commented Mar 4, 2024

hppritcha commented Mar 4, 2024

bosilca commented Mar 4, 2024

dalcinl commented Mar 6, 2024

ggouaillardet commented Mar 6, 2024

dalcinl commented Mar 6, 2024 • edited Loading

dalcinl commented Mar 6, 2024

ggouaillardet commented Mar 6, 2024

dalcinl commented Mar 6, 2024

ggouaillardet commented Mar 6, 2024 • edited by bosilca Loading

hppritcha commented Mar 6, 2024

wenduwan commented Mar 6, 2024

hppritcha commented Mar 6, 2024

hppritcha commented Mar 6, 2024

ggouaillardet commented Mar 6, 2024

hppritcha commented Mar 6, 2024

dalcinl commented Mar 6, 2024

hppritcha commented Mar 6, 2024

dalcinl commented Mar 6, 2024

ggouaillardet commented Mar 6, 2024

bosilca commented Mar 6, 2024

dalcinl commented Mar 6, 2024

bosilca commented Mar 6, 2024

dalcinl commented Mar 6, 2024

ggouaillardet commented Mar 7, 2024 • edited Loading

ggouaillardet commented Mar 7, 2024

ggouaillardet commented Mar 7, 2024

dalcinl commented Mar 8, 2024

bosilca commented Mar 8, 2024

hppritcha commented Mar 8, 2024

hppritcha commented Mar 8, 2024

hppritcha commented Apr 6, 2024 • edited Loading

hppritcha commented Apr 11, 2024

wenduwan commented Apr 11, 2024

hppritcha commented Apr 11, 2024

bosilca commented Apr 11, 2024

dalcinl commented Apr 15, 2024

hppritcha commented Apr 17, 2024

dalcinl commented Apr 18, 2024

dalcinl commented Mar 6, 2024 •

edited

Loading

ggouaillardet commented Mar 6, 2024 •

edited by bosilca

Loading

ggouaillardet commented Mar 7, 2024 •

edited

Loading

hppritcha commented Apr 6, 2024 •

edited

Loading