-
Notifications
You must be signed in to change notification settings - Fork 68
v1.10 - add extra mpi barrier in finalize flow #632
v1.10 - add extra mpi barrier in finalize flow #632
Conversation
To make sure no rank will call mxm_ep_disconnect before others finish doing MPI communications. this could be removed after refactoring MXM disconnect flow.
|
:bot:milestone:v1.10.1 |
|
Test PASSed. |
|
Actually, this commit changed mpi_finalize behavior and introduced issue in mxm: Before this commit, ompi_comm_finalize() was called before del_procs(), which ensured a barrier before calling del_procs. |
|
It doesn't seem like a good idea to unconditionally have an extra barrier in finalize, especially for systems that don't require it. Can this barrier be made conditional, somehow? E.g., systems that require it can ask for it somehow? (e.g., a PML, and/or BTL / MTL flag that indicates that the underlying point-to-point system needs it) |
|
Actually, I looked closer on master: it's a PMIX fence, and it's not conditional -- but it's OOB, not MPI. Can you explain further why this patch is necessary? Doesn't the rte barrier prevent MXM endpoints from disconnecting from each other too early? FWIW: @rhc54 and I talked about this patch, and we came up with a different theory as to why you need this patch... but after reading your commit message, I'm not sure it's correct (it had to do with direct launch and how the SLURM PMI fence is blocking)... |
|
The mxm disconnect flow requires that all processes doing disconnect will be waiting in "mxm disconnect" flow. So, oob or mpi barrier makes it happen. Before commit: c66ebf6 - we had barrier in-place. Maybe we can have global flag which will be set by component which requires barrier in finalize and mxm will set it? @yosefe - what do you think |
|
@miked-mellanox As I mentioned, there's already an RTE barrier in the v1.10 finalize. |
|
are you refering to this one? https://github.com/open-mpi/ompi-release/blob/v1.10/ompi/runtime/ompi_mpi_finalize.c#L223 If so - it is too late as happens after del_proc(): https://github.com/open-mpi/ompi-release/blob/v1.10/ompi/runtime/ompi_mpi_finalize.c#L144 |
|
Ah, ok -- you didn't say that before (that it needed to be before del_procs). Then yes, probably a global flag indicating "I need an RTE barrier before del_procs in ompi_mpi_finalize" (that defaults to false) would probably be fine. This basically what I suggested in #632 (comment). Just curious: why is this only coming up a year and a half after c66ebf6 was committed? |
|
Actually, a better question is: can the RTE barrier be moved up in the v1.10 ompi_mpi_finalize() to be before the del_procs, like it is in master? (I have to run at the moment and can't check this myself) If so, this whole conditional thing isn't necessary (although we could probably still do the conditional thing as an optimization for the transports that don't need it -- but we should probably do that on master first). |
|
Please see this: http://www.open-mpi.org/community/lists/devel/2014/07/15203.php for why-now. |
|
moving RTE barrier to before del_procs can work for us, it was there before. |
|
If you look carefully at the referenced commit, it is a revert of a prior commit from @hjelmn that had moved the barrier before del_procs. This was part of a broader finalize change. The reason it was reverted was a long discussion on telecon resulting from a spew of errors in the 1.8 series MTT that resulted from the change. So I reverted it as requested, and we scheduled the matter for discussion at the upcoming OMPI devel conference in June of 2014. The result of all that was a conclusion that the finalize change proposed by @hjelmn wasn't immediately backwards compatible as it relied on a number of other things. And so we left the 1.8 series alone. So I remain puzzled as to why this has been in the release series since the 1.7 days, was briefly modified as you now request, and shortly thereafter reverted - and yet only now is being raised again. From the messages of 15 months ago, it looks like this is an intermittent issue at most - has it not been a concern because it is rarely hit? Or as was suggested back then, is it only seen when someone's program isn't quite correct? |
|
I dont remember all details :( |
|
And apparently I don't remember all the details, either (since I just suggested moving the RTE barrier up before del_procs, which is apparently bad!). Doh! Trying to remember a bit more: I think that back then, it wasn't so much an issue of moving the RTE barrier up as it was moving the del_procs down... wasn't it? I.e., wasn't it moving the location of the del_procs that caused the problem (that was then reverted)? Armed with this (possibly incorrect) information...
Thoughts? |
|
I'd love to see this fixed upstream on master. |
|
@gpaulsen What's broken on master? (note that the code is quite different over there because master has PMIx) |
|
we reviewed v1.10 finalize code and probably we can add barrier call into del_proc() for yalla and mtl/mxm. |
|
@miked-mellanox Ok, that seems like a non-controversial fix. Thanks! |
Based on open-mpi#632 * limitation: does not support MPI_Spawn
|
closing this PR, replacing with this one: #644 |
Initialize convertor in pml-cm-send/recv
master commit: open-mpi/ompi#973