Skip to content

Conversation

@ggouaillardet
Copy link
Contributor

and simplify the error handling in topo/base upon ompi_comm_enable
invokation

and simplify the error handling in topo/base upon ompi_comm_enable
invokation
@ggouaillardet
Copy link
Contributor Author

@bosilca could you please review this ?
this addresses the issue you reported at http://www.open-mpi.org/community/lists/devel/2015/03/17119.php

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/338/
Test PASSed.

@bosilca
Copy link
Member

bosilca commented Mar 10, 2015

You cannot rely on the OBJ_RELEASE to correctly destroy the communicator for 2 reasons:

  1. The reference count on the communicator is tricky, and in some instances one can have 2 refs (thus one should call ompi_communicator_free at least once).
  2. the destructor doesn't free all the MPI level objects (topology and friends). Instead it simply release the groups, call the PML del_comm and remove the communicator from the cid and F2C tables.

Moreover, I don't even see the OBJ_RELEASE in this patch, but the call to ompi_communicator_free went missing. What was the original issue you were trying to fix?

@ggouaillardet
Copy link
Contributor Author

the original issue, is that

  • if ompi_comm_enable fails in ompi_comm_nextcid or ompi_comm_fill_rest,
    then new_comm is left unchanged
  • if ompi_comm_enable fails in ompi_comm_activate, then new_comm is OBJ_RELEASE'd
    (OBJ_RELEASE occurs in ompi_comm_activate)

at least in mca_topo_base_cart_create, reference_count of new_comm is always 1
(i checked this with a debugger), so if ompi_comm_activate fails, new_comm is released

back to mca_topo_base_cart_create, if ompi_comm_enable fails, new_comm might have been released or not ... so invoking ompi_comm_free on new_comm might crash.

communicator destructor does OBJ_RELEASE comm->c_topo if it is not NULL
( i am not sure of what you meant by "opology and friends")

@hppritcha
Copy link
Member

Where are we with this PR? Should it be closed without merge?

@jsquyres
Copy link
Member

@bosilca @ggouaillardet How's it going on this PR? :)

@hppritcha
Copy link
Member

Can one of the admins verify this patch?

@bosilca
Copy link
Member

bosilca commented May 11, 2015

I looked at the topo_base_cart_create.c, and it seems that this patch introduced some memory leaks. To be more precise with this patch the newly created communicator (new_comm) and the newly created cart and topo are not released.

@jsquyres
Copy link
Member

@ggouaillardet Can you address @bosilca's comments (#459 (comment))?

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Nov 10, 2015
…ob_tcp_finalize

Revert "oob/tcp: fix a race condition when finalizing the oob/tcp com…
@rhc54
Copy link
Contributor

rhc54 commented Jan 26, 2016

@ggouaillardet Can we bring this to conclusion? Been hanging around for a long time.

@ggouaillardet
Copy link
Contributor Author

let s close this PR
that was just a corner case in the error path anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants