-
Notifications
You must be signed in to change notification settings - Fork 934
Topic/coll requests #912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topic/coll requests #912
Conversation
0b306f2 to
71874d1
Compare
|
@bosilca there is something odd with your PR (it contains 14 commits, and i think you only meant to include bosilca/ompi@71874d1) if you think one allocation per communicator no matter what, is an issue, then i am fine with this approach. |
|
@bosilca It looks like you made this PR off your "master" branch instead of off a topic branch. As such, the PR is picking up any new commits that you put on your "master" branch. I think you almost certainly want to use a topic branch -- you might need to close this PR, make a topic branch, put the commit(s) on it that you want, and make a new PR from that branch. |
2e8ec56 to
80cd46f
Compare
|
from the name, this is a topic branch. |
|
Thanks guys, the issues has been meanwhile fixed. |
|
@bosilca i am still experiencing some crashes with the i will not have time to fully investigate this before thursday, and at first glance i noticed |
10f7cd1 to
3ddb709
Compare
|
@ggouaillardet I just updated the basic module. Please try again. |
|
@bosilca it fails at runtime because |
|
@ggouaillardet the prototype was there but I forgot to push the actual implementation. |
|
@bosilca here is a patch for this PR diff --git a/ompi/mca/coll/basic/coll_basic_neighbor_allgather.c b/ompi/mca/coll/basic/coll_basic_neighbor_allgather.c
index 1b323d6..edce3fd 100644
--- a/ompi/mca/coll/basic/coll_basic_neighbor_allgather.c
+++ b/ompi/mca/coll/basic/coll_basic_neighbor_allgather.c
@@ -44,13 +44,13 @@ mca_coll_basic_neighbor_allgather_cart(const void *sbuf, int scount,
{
const mca_topo_base_comm_cart_2_2_0_t *cart = comm->c_topo->mtc.cart;
const int rank = ompi_comm_rank (comm);
- ompi_request_t **reqs;
+ ompi_request_t **reqs, **preqs;
ptrdiff_t lb, extent;
int rc = MPI_SUCCESS, dim, nreqs;
ompi_datatype_get_extent(rdtype, &lb, &extent);
- reqs = mca_coll_basic_get_reqs( (mca_coll_basic_module_t *) module, 4 * cart->ndims );
+ reqs = preqs = mca_coll_basic_get_reqs( (mca_coll_basic_module_t *) module, 4 * cart->ndims );
/* The ordering is defined as -1 then +1 in each dimension in
* order of dimension. */
for (dim = 0, nreqs = 0 ; dim < cart->ndims ; ++dim) {
@@ -66,7 +66,7 @@ mca_coll_basic_neighbor_allgather_cart(const void *sbuf, int scount,
nreqs += 2;
rc = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, srank,
MCA_COLL_BASE_TAG_ALLGATHER,
- comm, reqs++));
+ comm, preqs++));
if (OMPI_SUCCESS != rc) break;
/* remove cast from const when the pml layer is updated to take
@@ -74,7 +74,7 @@ mca_coll_basic_neighbor_allgather_cart(const void *sbuf, int scount,
rc = MCA_PML_CALL(isend((void *) sbuf, scount, sdtype, srank,
MCA_COLL_BASE_TAG_ALLGATHER,
MCA_PML_BASE_SEND_STANDARD,
- comm, reqs++));
+ comm, preqs++));
if (OMPI_SUCCESS != rc) break;
}
@@ -84,14 +84,14 @@ mca_coll_basic_neighbor_allgather_cart(const void *sbuf, int scount,
nreqs += 2;
rc = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, drank,
MCA_COLL_BASE_TAG_ALLGATHER,
- comm, reqs++));
+ comm, preqs++));
if (OMPI_SUCCESS != rc) break;
rc = MCA_PML_CALL(isend((void *) sbuf, scount, sdtype, drank,
MCA_COLL_BASE_TAG_ALLGATHER,
MCA_PML_BASE_SEND_STANDARD,
- comm, reqs++));
+ comm, preqs++));
if (OMPI_SUCCESS != rc) break;
}
|
|
@bosilca i found an other issue when releasing ompi_mpi_comm_world with OPAL_ENABLE_DEBUG diff --git a/ompi/communicator/comm_init.c b/ompi/communicator/comm_init.c
index a7f302b..f5531f2 100644
--- a/ompi/communicator/comm_init.c
+++ b/ompi/communicator/comm_init.c
@@ -263,7 +263,14 @@ int ompi_comm_finalize(void)
}
/* Shut down MPI_COMM_WORLD */
- OBJ_DESTRUCT( &ompi_mpi_comm_world );
+#if OPAL_ENABLE_DEBUG
+ assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (&ompi_mpi_comm_world))->obj_magic_id);
+#endif
+ opal_obj_run_destructors((opal_object_t *)&ompi_mpi_comm_world);
+#if OPAL_ENABLE_DEBUG
+ OBJ_SET_MAGIC_ID((&ompi_mpi_comm_world), 0);
+#endif
+ OBJ_REMEMBER_FILE_AND_LINENO( &ompi_mpi_comm_world, __FILE__, __LINE__ );
/* Shut down the parent communicator, if it exists */
if( ompi_mpi_comm_parent != &ompi_mpi_comm_null.comm ) { |
|
@bosilca Once this PR is working, it is likely worthwhile to squash the "add the missing function" commit into whichever other commit it belongs. |
|
I also found another issue where some of the base functions are used by the basic module. In this case the base functions will try to access the base array of requests, while only the basic one has been allocated. Once I have a patch, I will integrate everything (including reverting your previous patches) and submit the PR. |
0658d00 to
89fc3b3
Compare
|
@ggouaillardet I added your first patch, but not the second. There was a hided problem with some leaked persistent requests that I fixed in my own patch. I did not revert your 2 previous commits (76204df Otherwise, I think this PR is now in good shape, ready to be merged. |
4b7827b to
99dd3e1
Compare
|
@bosilca your commits plus my two partial reverts are available in my repo branch from a git point of view, you can do |
9b04731 to
dc7a5b2
Compare
|
@ggouaillardet everything has been merged here. I kept 2 separate commits for the reverts, to keep a clean track of what was happening. |
module_data to hold one with the largest necessary size. This array is only allocated when needed, and it is released upon communicator destruction.
number of requests. Remove unnecessary fields.We don't need these fields.
in the base). Correctly deal with persistent requests (they must be always freed when they are stored in the request array associated with the communicator). Always use MPI_STATUS_IGNORE for single request waiting functions.
…h_create(...)" This partially reverts commit open-mpi/ompi@27e4389.
…s if the degree" This partially reverts commit open-mpi/ompi@76204df.
dc7a5b2 to
e946c82
Compare
|
@ggouaillardet please review. I would like to get it in asap. |
|
👍 |
This patch fixes the issues identified by @ggouaillardet in the IBM tests (collectives and topologies). It also improves the memory usage of OMPI, as a communicator without collective communications will never allocate the array of requests needed to coordinate the basic collective algorithms. This ticket replaced #790.
NEWS - add a blurb about issue 1225
This patch fixes the issues identified by Gilles with the ibm tests (collectives and topologies). In addition it improves the memory usage of OMPI, a communicator without collective communications will never allocate the array of requests used to coordinate the basic collective algorithms.
This ticket is supposed to replace #790. @ggouaillardet and @hjelmn please review.