Skip to content

Conversation

@bosilca
Copy link
Member

@bosilca bosilca commented Sep 20, 2015

This patch fixes the issues identified by Gilles with the ibm tests (collectives and topologies). In addition it improves the memory usage of OMPI, a communicator without collective communications will never allocate the array of requests used to coordinate the basic collective algorithms.

This ticket is supposed to replace #790. @ggouaillardet and @hjelmn please review.

@ggouaillardet
Copy link
Contributor

@bosilca there is something odd with your PR (it contains 14 commits, and i think you only meant to include bosilca/ompi@71874d1)
i guess the root cause is you branched from your own master and not from open-mpi master

if you think one allocation per communicator no matter what, is an issue, then i am fine with this approach.
in this case, we should probably use the same approach on both v1.8, v1.10, v2.x and master.
on master, that means we should revert (parts of)
76204df
27e4389
and revert the related of the one-off commits on v1.8 and v1.10

@jsquyres
Copy link
Member

@bosilca It looks like you made this PR off your "master" branch instead of off a topic branch. As such, the PR is picking up any new commits that you put on your "master" branch.

I think you almost certainly want to use a topic branch -- you might need to close this PR, make a topic branch, put the commit(s) on it that you want, and make a new PR from that branch.

@bosilca bosilca force-pushed the topic/coll_requests branch 2 times, most recently from 2e8ec56 to 80cd46f Compare September 21, 2015 09:33
@ggouaillardet
Copy link
Contributor

from the name, this is a topic branch.
you can simply do

git remote update
git rebase -i <name of the open-mpi repository>/master
git push --force -u <name of bosilca repository> topic/coll_requests

@bosilca
Copy link
Member Author

bosilca commented Sep 21, 2015

Thanks guys, the issues has been meanwhile fixed.

@ggouaillardet
Copy link
Contributor

@bosilca i am still experiencing some crashes with the neighbor_allgather_self test from the ibm test suite.

i will not have time to fully investigate this before thursday, and at first glance i noticed
your commit is related to accessing the mcct_reqs of the mca_coll_base_comm_t object.
but in this test, ompi accesses the mccb_reqs of the coll/basic module.

@bosilca bosilca force-pushed the topic/coll_requests branch from 10f7cd1 to 3ddb709 Compare September 22, 2015 23:06
@bosilca
Copy link
Member Author

bosilca commented Sep 22, 2015

@ggouaillardet I just updated the basic module. Please try again.

@ggouaillardet
Copy link
Contributor

@bosilca it fails at runtime because mca_coll_basic_get_reqs is not defined
(but declared in ompi/mca/coll/basic/coll_basic.h)

@bosilca
Copy link
Member Author

bosilca commented Sep 23, 2015

@ggouaillardet the prototype was there but I forgot to push the actual implementation.

@ggouaillardet
Copy link
Contributor

@bosilca here is a patch for this PR

diff --git a/ompi/mca/coll/basic/coll_basic_neighbor_allgather.c b/ompi/mca/coll/basic/coll_basic_neighbor_allgather.c
index 1b323d6..edce3fd 100644
--- a/ompi/mca/coll/basic/coll_basic_neighbor_allgather.c
+++ b/ompi/mca/coll/basic/coll_basic_neighbor_allgather.c
@@ -44,13 +44,13 @@ mca_coll_basic_neighbor_allgather_cart(const void *sbuf, int scount,
 {
     const mca_topo_base_comm_cart_2_2_0_t *cart = comm->c_topo->mtc.cart;
     const int rank = ompi_comm_rank (comm);
-    ompi_request_t **reqs;
+    ompi_request_t **reqs, **preqs;
     ptrdiff_t lb, extent;
     int rc = MPI_SUCCESS, dim, nreqs;

     ompi_datatype_get_extent(rdtype, &lb, &extent);

-    reqs = mca_coll_basic_get_reqs( (mca_coll_basic_module_t *) module, 4 * cart->ndims );
+    reqs = preqs = mca_coll_basic_get_reqs( (mca_coll_basic_module_t *) module, 4 * cart->ndims );
     /* The ordering is defined as -1 then +1 in each dimension in
      * order of dimension. */
     for (dim = 0, nreqs = 0 ; dim < cart->ndims ; ++dim) {
@@ -66,7 +66,7 @@ mca_coll_basic_neighbor_allgather_cart(const void *sbuf, int scount,
             nreqs += 2;
             rc = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, srank,
                                     MCA_COLL_BASE_TAG_ALLGATHER,
-                                    comm, reqs++));
+                                    comm, preqs++));
             if (OMPI_SUCCESS != rc) break;

             /* remove cast from const when the pml layer is updated to take
@@ -74,7 +74,7 @@ mca_coll_basic_neighbor_allgather_cart(const void *sbuf, int scount,
             rc = MCA_PML_CALL(isend((void *) sbuf, scount, sdtype, srank,
                                     MCA_COLL_BASE_TAG_ALLGATHER,
                                     MCA_PML_BASE_SEND_STANDARD,
-                                    comm, reqs++));
+                                    comm, preqs++));
             if (OMPI_SUCCESS != rc) break;
         }

@@ -84,14 +84,14 @@ mca_coll_basic_neighbor_allgather_cart(const void *sbuf, int scount,
             nreqs += 2;
             rc = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, drank,
                                     MCA_COLL_BASE_TAG_ALLGATHER,
-                                    comm, reqs++));
+                                    comm, preqs++));
             if (OMPI_SUCCESS != rc) break;


             rc = MCA_PML_CALL(isend((void *) sbuf, scount, sdtype, drank,
                                     MCA_COLL_BASE_TAG_ALLGATHER,
                                     MCA_PML_BASE_SEND_STANDARD,
-                                    comm, reqs++));
+                                    comm, preqs++));
             if (OMPI_SUCCESS != rc) break;
         }

@ggouaillardet
Copy link
Contributor

@bosilca i found an other issue when releasing ompi_mpi_comm_world with OPAL_ENABLE_DEBUG
the inlined patch is more of a proof of concept than a real fix

diff --git a/ompi/communicator/comm_init.c b/ompi/communicator/comm_init.c
index a7f302b..f5531f2 100644
--- a/ompi/communicator/comm_init.c
+++ b/ompi/communicator/comm_init.c
@@ -263,7 +263,14 @@ int ompi_comm_finalize(void)
     }

     /* Shut down MPI_COMM_WORLD */
-    OBJ_DESTRUCT( &ompi_mpi_comm_world );
+#if OPAL_ENABLE_DEBUG
+    assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (&ompi_mpi_comm_world))->obj_magic_id);
+#endif
+    opal_obj_run_destructors((opal_object_t *)&ompi_mpi_comm_world);
+#if OPAL_ENABLE_DEBUG
+    OBJ_SET_MAGIC_ID((&ompi_mpi_comm_world), 0);
+#endif
+    OBJ_REMEMBER_FILE_AND_LINENO( &ompi_mpi_comm_world, __FILE__, __LINE__ );

     /* Shut down the parent communicator, if it exists */
     if( ompi_mpi_comm_parent != &ompi_mpi_comm_null.comm ) {

@jsquyres
Copy link
Member

@bosilca Once this PR is working, it is likely worthwhile to squash the "add the missing function" commit into whichever other commit it belongs.

@bosilca
Copy link
Member Author

bosilca commented Sep 24, 2015

I also found another issue where some of the base functions are used by the basic module. In this case the base functions will try to access the base array of requests, while only the basic one has been allocated. Once I have a patch, I will integrate everything (including reverting your previous patches) and submit the PR.

@bosilca bosilca force-pushed the topic/coll_requests branch from 0658d00 to 89fc3b3 Compare September 25, 2015 01:07
@bosilca
Copy link
Member Author

bosilca commented Sep 25, 2015

@ggouaillardet I added your first patch, but not the second. There was a hided problem with some leaked persistent requests that I fixed in my own patch.

I did not revert your 2 previous commits (76204df
27e4389). I will be traveling for the next 2-3 days, and I will not have the time to do it, so if you want to go ahead and revert them I will really appreciate it.

Otherwise, I think this PR is now in good shape, ready to be merged.

@bosilca bosilca force-pushed the topic/coll_requests branch 2 times, most recently from 4b7827b to 99dd3e1 Compare September 27, 2015 15:03
@ggouaillardet
Copy link
Contributor

@bosilca your commits plus my two partial reverts are available in my repo branch
https://github.com/ggouaillardet/ompi/tree/topic/coll_requests
you may likely want to squash my two commits since I am not sure the first one can work without the other

from a git point of view, you can do
git remote add ggouaillardet https://github.com/ggouaillardet/ompi.git
git remote update ggouaillardet
git log ggouaillardet/topic/coll_requests
git cherry-pick ...

@bosilca bosilca force-pushed the topic/coll_requests branch 2 times, most recently from 9b04731 to dc7a5b2 Compare September 28, 2015 15:57
@bosilca
Copy link
Member Author

bosilca commented Sep 28, 2015

@ggouaillardet everything has been merged here. I kept 2 separate commits for the reverts, to keep a clean track of what was happening.

bosilca and others added 5 commits October 8, 2015 12:00
module_data to hold one with the largest necessary size. This array is
only allocated when needed, and it is released upon communicator
destruction.
number of requests.

Remove unnecessary fields.We don't need these fields.
in the base).
Correctly deal with persistent requests (they must be always freed when
they are stored in the request array associated with the communicator).
Always use MPI_STATUS_IGNORE for single request waiting functions.
@bosilca bosilca force-pushed the topic/coll_requests branch from dc7a5b2 to e946c82 Compare October 8, 2015 16:00
@bosilca
Copy link
Member Author

bosilca commented Oct 8, 2015

@ggouaillardet please review. I would like to get it in asap.

@ggouaillardet
Copy link
Contributor

👍
sorry for the delay ...

bosilca added a commit that referenced this pull request Oct 9, 2015
This patch fixes the issues identified by @ggouaillardet in the IBM tests (collectives and topologies). It also improves the memory usage of OMPI, as a communicator without collective communications will never allocate the array of requests needed to coordinate the basic collective algorithms. This ticket replaced #790.
@bosilca bosilca merged commit 1310acc into open-mpi:master Oct 9, 2015
@bosilca bosilca deleted the topic/coll_requests branch October 9, 2015 15:27
jsquyres added a commit to jsquyres/ompi that referenced this pull request Aug 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants