-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpi4py: Regression in alltoall with intercommunicators #12367
Comments
The specific error is:
|
What do you mean? That's exactly what I reported in the description right below the link to the full logs . Am I missing something? |
Errr... no, sorry. You didn't report the version of Open MPI, and I had to dig into the CI link to find it; I labeled that then copy-and-pasted the specific error out of the logs mostly on auto-pilot (my eyes skipped right over the long line in your description and I missed it). |
@dalcinl could you remind me which arg to main.py controls the set of tests which are run? |
okay I figured that out. I'm not able to reproduce this problem running standalone. |
This errors is only possible when out-of-sequence messages are left into a process queue when the communicator is freed or when MPI_Finalize is called. This means there are messages unmatched and out-of-sequence, kind of suggesting some incorrect MPI application. |
This failing test is doing a series of shot-message Python (pickle-based) comm = make_intercomm(MPI.COMM_WORLD)
size = comm.Get_remote_size()
for smess in messages:
rmess = comm.alltoall(None)
assert rmess == [None]*size
rmess = comm.alltoall([smess]*size
assert rmess == [smess]*size
rmess = comm.alltoall(iter([smess]*size))
assert rmess == [smess]*size
comm.Free() The algorithm has not change in 15+ years: 1) This test has been working on MPICH/Intel MPI/Microsoft MPI on Linux/macOS/Windows. I never gave a problem with Open MPI v4.1.x, v5.0.x, and until recently, the main branch. Moreover, the issue is not easy to reproduce, and the test even pass most times in GitHub Actions. While I understand it is tempting to assume that the problem is on the application (that is, mpi4py) side, IMHO all the evidence points to the backend. A sequence of I'm indeed freeing the intercommunicator (with |
@dalcinl is there a command to run only the also, what if you |
The main entry point to run everything is
PS: Run
I'll try locally, but given that I cannot reproduce that way, I'm not sure such a try will be really relevant. |
OK, now I can reproduce. I have a machine with 128 virtual cores. If I run with 127 mpi4py processes it seems to work most of the times:
but as soon as I ask for 128 MPI processes, then the assertion triggers:
The
|
my bad, the correct syntax is
|
I figured out, eventually... it does not fix the issue. |
I confirm I was able to reproduce the issue. The inline reproducer can be used to evidence it.
#include <mpi.h>
#include <stdio.h>
MPI_Comm intercomm;
int main(int argc, char *argv[]) {
int rank, size, lrank, lsize, rsize;
MPI_Comm local;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_split(MPI_COMM_WORLD, (rank<size/2)?0:1, 0, &local);
MPI_Comm_rank(local, &lrank);
MPI_Comm_size(local, &lsize);
printf("I am %d: %d / %d\n", rank, lrank, lsize);
MPI_Intercomm_create(local, 0, MPI_COMM_WORLD, (rank<size/2)?size/2:0, 0, &intercomm);
MPI_Comm_remote_size(intercomm, &rsize);
size_t sbuf[size], rbuf[size];
MPI_Request reqs[size*2];
MPI_Request *req = reqs;
int i;
for (i=0; i<rsize; i++, req++) {
MPI_Irecv(rbuf+i, 1, MPI_COUNT, i, 0, intercomm, req);
}
for (i=0; i<rsize; i++, req++) {
MPI_Isend(sbuf+i, 1, MPI_COUNT, i, 0, intercomm, req);
}
MPI_Waitall(rsize*2, reqs, MPI_STATUSES_IGNORE);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Comm_free(&intercomm);
MPI_Comm_free(&local);
MPI_Finalize();
return 0;
} |
I still can't reproduce either the above 'c' code or mpi4py test at 128 ranks.
|
Is this main branch only, or also affecting 5.0? |
i have not checked @wenduwan the 5.0.x branch. UCX seems to be complaining that a receive that had been posted was never matched when the app called mpi_finalize. |
i'm not sure what @ggouaillardet code snippet is suppose to be demonstrating. its only doing a bunch if isends. |
The program crashes in my environment (most of the time) with 13 tasks. My box only has 4 cores, so maybe oversubscription helps evidencing the issue. |
your example is just doing isends. where are the matching receives? |
@wenduwan I had not seen the issue with v5.0.x on GHA so far. I tried locally with 128 and 256 MPI processes many times, and the test runs to completion without any issue. @hppritcha I'm using the following configure options, note options=(
--prefix=/home/devel/mpi/openmpi/$branch
--without-ofi
--without-ucx
--without-psm2
--without-cuda
--without-rocm
--with-pmix=internal
--with-prrte=internal
--with-libevent=internal
--with-hwloc=internal
--enable-debug
--enable-mem-debug
) |
yep i'm using the --enable-debug. i've not used --enable-mem-debug but i'll add that. |
BTW, just to add reinforce my point that I don't believe this issue is mpi4py's fault, I run 128 and 256 processes with v4.1.x (same configure options as above), and all seems to be good. |
Oops. The first call should be MPI_Irecv(), I will double check that. This was supposed to be a MPI_Alltoall() that can also be used to evidence the issue. |
I confirm, ob1 seems to be broken (independent of the used BTL). I altered @ggouaillardet code above such that on deadlock one can do |
In case it is useful, the first time I noticed the error was about 2 week ago while running commit cb00772. |
Don't think that particular commit can have any impact (none of our machines would be using the |
Definitely. My whole point was to speedup a bit the bisection. The faulty commit is hopefully not too much older than the one I reported above. |
fwiw, it seems some fixes were applied later, but that did not fix the issue we are discussing now. |
here is the reproducer I am now using: I noted that if I uncomment the #include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
int rank, size, lrank, lsize, rsize;
MPI_Comm local, inter;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_split(MPI_COMM_WORLD, (rank<size/2)?0:1, 0, &local);
MPI_Comm_rank(local, &lrank);
MPI_Comm_size(local, &lsize);
printf("I am %d: %d / %d\n", rank, lrank, lsize);
MPI_Intercomm_create(local, 0, MPI_COMM_WORLD, (rank<size/2)?size/2:0, 0, &inter);
// MPI_Barrier(MPI_COMM_WORLD);
MPI_Comm_remote_size(inter, &rsize);
size_t sbuf[size], rbuf[size];
MPI_Request reqs[size*2];
MPI_Request *req = reqs;
int i;
for (i=0; i<rsize; i++, req++) {
MPI_Irecv(rbuf+i, 1, MPI_COUNT, i, 0, inter, req);
}
for (i=0; i<rsize; i++, req++) {
MPI_Isend(sbuf+i, 1, MPI_COUNT, i, 0, inter, req);
}
MPI_Waitall(rsize*2, reqs, MPI_STATUSES_IGNORE);
MPI_Comm_free(&inter);
MPI_Comm_free(&local);
MPI_Finalize();
return 0;
} |
The difference between now and before, is there is no more allreduce on inter communicators. diff --git a/ompi/communicator/comm_cid.c b/ompi/communicator/comm_cid.c
index be022b8..fa427c6 100644
--- a/ompi/communicator/comm_cid.c
+++ b/ompi/communicator/comm_cid.c
@@ -917,14 +917,14 @@ int ompi_comm_activate_nb (ompi_communicator_t **newcomm, ompi_communicator_t *c
*/
ret = context->iallreduce_fn (&context->local_peers, &context->max_local_peers, 1, MPI_MAX, context,
&subreq);
- if (OMPI_SUCCESS != ret) {
- ompi_comm_request_return (request);
- return ret;
- }
- ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, &subreq, 1);
} else {
- ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, NULL, 0);
+ ret = context->iallreduce_fn (NULL, NULL, 0, MPI_MAX, context, &subreq);
+ }
+ if (OMPI_SUCCESS != ret) {
+ ompi_comm_request_return (request);
+ return ret;
}
+ ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, &subreq, 1);
ompi_comm_request_start (request);
diff --git a/ompi/mca/coll/libnbc/nbc_ireduce.c b/ompi/mca/coll/libnbc/nbc_ireduce.c
index 1f7464e..3052f2a 100644
--- a/ompi/mca/coll/libnbc/nbc_ireduce.c
+++ b/ompi/mca/coll/libnbc/nbc_ireduce.c
@@ -67,7 +67,7 @@ static int nbc_reduce_init(const void* sendbuf, void* recvbuf, int count, MPI_Da
MPI_Aint ext;
NBC_Schedule *schedule;
char *redbuf=NULL, inplace;
- void *tmpbuf;
+ void *tmpbuf = NULL;
char tmpredbuf = 0;
enum { NBC_RED_BINOMIAL, NBC_RED_CHAIN, NBC_RED_REDSCAT_GATHER} alg;
ompi_coll_libnbc_module_t *libnbc_module = (ompi_coll_libnbc_module_t*) module;
@@ -126,25 +126,27 @@ static int nbc_reduce_init(const void* sendbuf, void* recvbuf, int count, MPI_Da
}
/* allocate temporary buffers */
- if (alg == NBC_RED_REDSCAT_GATHER || alg == NBC_RED_BINOMIAL) {
- if (rank == root) {
- /* root reduces in receive buffer */
- tmpbuf = malloc(span);
- redbuf = recvbuf;
+ if (0 < span) {
+ if (alg == NBC_RED_REDSCAT_GATHER || alg == NBC_RED_BINOMIAL) {
+ if (rank == root) {
+ /* root reduces in receive buffer */
+ tmpbuf = malloc(span);
+ redbuf = recvbuf;
+ } else {
+ /* recvbuf may not be valid on non-root nodes */
+ ptrdiff_t span_align = OPAL_ALIGN(span, datatype->super.align, ptrdiff_t);
+ tmpbuf = malloc(span_align + span);
+ redbuf = (char *)span_align - gap;
+ tmpredbuf = 1;
+ }
} else {
- /* recvbuf may not be valid on non-root nodes */
- ptrdiff_t span_align = OPAL_ALIGN(span, datatype->super.align, ptrdiff_t);
- tmpbuf = malloc(span_align + span);
- redbuf = (char *)span_align - gap;
- tmpredbuf = 1;
+ tmpbuf = malloc (span);
+ segsize = 16384/2;
}
- } else {
- tmpbuf = malloc (span);
- segsize = 16384/2;
- }
- if (OPAL_UNLIKELY(NULL == tmpbuf)) {
- return OMPI_ERR_OUT_OF_RESOURCE;
+ if (OPAL_UNLIKELY(0 < span && NULL == tmpbuf)) {
+ return OMPI_ERR_OUT_OF_RESOURCE;
+ }
}
#ifdef NBC_CACHE_SCHEDULE |
@ggouaillardet Oh boy... this is the damn comm_cid stuff that has been causing grief elsewhere. Maybe that's the root of the problem, without the barrier, the whole thing is broken for intercomms. BTW, maybe a better fix would to actually use an ibarrier ? |
Let's not get ahead of the solution here. comm_cid selection algorithm is one of the most critical pieces on communicator management, it has hundreds of thousands tests on it every night and it performs exactly how is was designed. What the last reduce (which in the case of the intercomm would give an answer we would not be able to use anyway) provides is an extra synchronization preventing processes from sending messages too early. However, this reduce being applied on the parent communicator it does indeed behave as a barrier for the creation of the new intercom and as such it delays arrivals of messages until the local parts of the communicators are up and running (this is also confirmed by the fact that your error was frelated to the frags_cant_match list). We had protection against this, but for some reason it does not seem to work correctly. |
i was finally able to reproduce (sometimes) the mpi4py alltoall hang (although in a different place - test_cco_obj_inter) by setting up a VM with 4 cores and running oversubscribed by 1. |
If I revert commit 23df181 I'm not seeing a hang when running |
I am able to reproduce regularly with @ggouaillardet test case and am looking in to this. |
okay, i figured out the problem here. There's either a design flaw in the way CIDS are being allocated or else we really do need a barrier like operation on the parent communicator used to created the new intercomm. What i determined is that here in the OB1 PML, a bogus communicator is being returned (because the parent comm was putting as a place holder in the here's the code snippet in
The problem is the ompi_comm_lookup. The way the vector of communicator handles is being used in the cid alloc algorithm, the parent communicator is used as a "placeholdr" in this array to help deal with the case of multiple threads trying to get a cid. BUT this communicator is not valid for use with the ctx in the hdr pkt. So rather than the pkt being routed on to the non_existing_communicator_pending it gets put on the wrong mca_pml_ob1_comm_proc_t struct. Now in the comm_cid.c in 4.1.x branch here's the original code we had that related to this:
but now in main and 5.0.x it looks like this
Its unfortunate that the useful note about barrier comment was removed. would have saved a week or so of debugging. |
@hppritcha A minor correction - the barrier still exists on v5.0.x - it's only removed in main. |
ah good point @wenduwan . |
The existence of the code protecting early arrival of messages proves that I am correct in my assessment that a barrier is, or at least was at some point, not necessary. Indeed, if the barrier was necessary there would be no need for Going back to your analysis that we put the parent communicator in the |
A barrier was removed from the cid allocation algorithm for intercommunicators that leads to random hangs when doing MPI_Intercomm_create or moral equivalent. See issue open-mpi#12367 the cid allocation algorithm and comm from cid lookup was modified to use a pointer to ompi_mpi_comm_null as the placeholder in the list of pointers to ompi commucators. A convenience debug function was added to comm_cid.c to dump out this table of communicators. Signed-off-by: Howard Pritchard <howardp@lanl.gov>
A barrier was removed from the cid allocation algorithm for intercommunicators that leads to random hangs when doing MPI_Intercomm_create or moral equivalent. See issue open-mpi#12367 the cid allocation algorithm and comm from cid lookup was modified to use a pointer to ompi_mpi_comm_null as the placeholder in the list of pointers to ompi commucators. A convenience debug function was added to comm_cid.c to dump out this table of communicators. Signed-off-by: Howard Pritchard <howardp@lanl.gov>
A barrier was removed from the cid allocation algorithm for intercommunicators that leads to random hangs when doing MPI_Intercomm_create or moral equivalent. See issue open-mpi#12367 the cid allocation algorithm and comm from cid lookup was modified to use a pointer to ompi_mpi_comm_null as the placeholder in the list of pointers to ompi commucators. A convenience debug function was added to comm_cid.c to dump out this table of communicators. Signed-off-by: Howard Pritchard <howardp@lanl.gov>
A barrier was removed from the cid allocation algorithm for intercommunicators that leads to random hangs when doing MPI_Intercomm_create or moral equivalent. See issue open-mpi#12367 the cid allocation algorithm and comm from cid lookup was modified to use OMPI_COMM_SENTINEL as the placeholder in the list of pointers to ompi commucators. Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Guys, may I gently request to add the barrier back until a definitive and more performant solution is found? Otherwise, this issue keeps giving trouble when testing unrelated developments like #12384. |
A barrier was removed from the cid allocation algorithm for intercommunicators that leads to random hangs when doing MPI_Intercomm_create or moral equivalent. See issue open-mpi#12367 the cid allocation algorithm and comm from cid lookup was modified to use OMPI_COMM_SENTINEL as the placeholder in the list of pointers to ompi commucators. Signed-off-by: Howard Pritchard <howardp@lanl.gov>
A barrier was removed from the cid allocation algorithm for intercommunicators that leads to random hangs when doing MPI_Intercomm_create or moral equivalent. See issue open-mpi#12367 the cid allocation algorithm and comm from cid lookup was modified to use OMPI_COMM_SENTINEL as the placeholder in the list of pointers to ompi commucators. Signed-off-by: Howard Pritchard <howardp@lanl.gov>
fixed via #12465 |
Thanks, @hppritcha !!! |
A barrier was removed from the cid allocation algorithm for intercommunicators that leads to random hangs when doing MPI_Intercomm_create or moral equivalent. See issue open-mpi#12367 the cid allocation algorithm and comm from cid lookup was modified to use OMPI_COMM_SENTINEL as the placeholder in the list of pointers to ompi commucators. Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 15a3d26)
https://github.com/mpi4py/mpi4py-testing/actions/runs/8034284372/job/21945704643
BTW, I've submitted #12357 to re-enable multi-proc runs of mpi4py tests. I've just rebased that PR, but now test may fail the same way I'm reporting here.
The text was updated successfully, but these errors were encountered: