-
Notifications
You must be signed in to change notification settings - Fork 68
mtl: add query method to mtl components #409
mtl: add query method to mtl components #409
Conversation
@hppritcha Ran into some compile-time errors in the MXM component:
|
botched the cross reference to commit, updating. |
041aa65
to
3f8e41d
Compare
Cannot compile mtl:ofi with this PR because ompi/mca/mtl/ofi/mtl_ofi_component.c contains conflict markers!! |
@@ -87,11 +87,19 @@ ompi_mtl_ofi_component_open(void) | |||
return OMPI_SUCCESS; | |||
} | |||
|
|||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these should be removed.
A build w/ portals4 (but without ofi) shows not only the swapped initializers, but also a missing cast:
By inspection I suspect |
Switch to using the query/priority method for selecting MTLs. This switch was motivated by the fact that now on some platforms, its possible for multiple MTLs to be initializable, but only one MTL should be selected. In addition, there is a complication with the PSM and IFO (with PSM provider) MTLs owing to the fact that they cannot both intialize the underlying PSM context, i.e. only one call to psm_init is allowed per process. The mxm component has not been compiled as the author doesn't currently have access to a system with a recent enough mxm installed to allow for a compile. The portals4, ofi, and psm components have been checked for compilation. The ofi and psm components have been checked for runtime correctness on a intel/qlogic system with up to date PSM installed. Manually patch mxm, removing part of commit open-mpi/ompi@a4990de0 Fixes open-mpi/ompi#735 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
3f8e41d
to
6daef31
Compare
okay pushing fixes up. Some of these - like the portals4 problem - was actually fixed in another pr in master. Didn' realize I was not compiling the ofi mtl on my qlogic box. That's fixed. But the machine is very busy and I can't get a session to test. I'll push new changes. @yburette could you test on one of your systems? |
@yburette found a bug in the ofi mtl which will need to be fixed to avoid the segfault @PHHargrove is seeing. I'll open a PR shortly on master for you to review. |
@hppritcha I finally got some True Scale machines to try this PR. It is working for me. The PSM MTL is being selected automatically. @PHHargrove I also tried without any parameter (apart from -n and -H) on the command line and it seems to work. |
@yburette The key to recreating the problem is that you have to get an MTL to build, but not have it be selected at runtime - without specifying the ^mtl param so it will be opened and then closed. Otherwise, it will always be selected and the problem won't occur. |
the truescale machine here is less busy today so I can run stuff. Ralph is right - for 1.10. You have to force selection of mtl to be ofi, but because of the not-so-great things going on in the cm select method, you end up using the ob1 pml. On master, the not-so-great things going on in the cm select method allow things to work -depending on your compiler. Any rate, @yburette please review open-mpi/ompi#747. |
@rhc54, I agree. I forgot to mention that both the PSM and OFI MTLs were built.
Am I missing anything? |
@yburette |
mpirun -host f010,f011 -n 2 ./test |
please change to mpirun -host f010,f011 -n 2 --mca mtl ofi ./test |
i'm assuming your working with the 1.10 release candidate. |
I'm working with this PR. When I add --mca mtl ofi (but not --mca pml cm), I get the following error:
Is this Seg fault related to the problem with inlining you mention in open-mpi/ompi/pull/747? |
Finally read the mailing list :-) I added the call to opal_progress_unregister() in the OFI MTL's close() function and I don't see the Seg fault anymore: @@ -98,7 +98,18 @@ ompi_mtl_ofi_component_query(mca_base_module_t **module, int *priority)
static int
ompi_mtl_ofi_component_close(void)
{
- return OMPI_SUCCESS;
+ int ret;
+
+ /**
+ * Unregister progress callback.
+ */
+ ret = opal_progress_unregister(ompi_mtl_ofi_progress);
+ if (OMPI_SUCCESS != ret) {
+ opal_output_verbose(1, ompi_mtl_base_framework.framework_output,
+ "%s:%d: opal_progress_unregister failed: %d\n",
+ __FILE__, __LINE__, ret);
+ }
+ return ret;
} |
@yburette |
@yburette the segfault you are reporting above appears to be something new. |
@yburette Does this address the potential inlining of the progress function (Howard mentioned he saw problems w.r.t. the inlining -- but it may have been related to not unregistering during close...?). |
@PHHargrove yes, I have just pushed it to my fork Hope this helps |
@hppritcha @jsquyres This definitely seems to fix the issue I reported this morning. Howard, what would be the best way for me to reproduce the issue w.r.t. inlining? |
yburette/ompi-release@4dc4d15 is equivalent to open-mpi/ompi#747 although the possible problem of passing an inline function to opal_progress_register may reoccur since the progress function is still marked as inlineable. I've never observed the "There are more than one active ports on host..." message. I think that's something different, probably owing to the ofi/psm provider as well as the openib btl trying to initialize. |
Short version: Yohann's commit 4dc4d15 and open-mpi/ompi#747 are the same for me. Longer version: I previously reported on the mailing list this PR + open-mpi/ompi#747 worked when passing no args other than "-np 2" to mpirun. However, I still see (just as reported on the mailing list with respect to 747) a SEGV if I run "mpirun -mca btl sm,self -np 1 examples/ring_c" unless I also add "-mca mtl ^ofi" or "-mca mtl portals4". |
The "There are more than one active ports on host..." message is something I've seen before on systems with multiple ports on the same IB network, and is the subject of an FAQ Entry. On systems I use that are connected this way, I set the MCA parameter described in that FAQ entry to disable the warning. |
#409 applied, built and tested OK with mtl-portals4 in a single MTL build. All my setups are mtl-portals4 only, so I haven't been able to check if it segfaults if it is not selected. I don't think mtl-portals4 is susceptible because progress is not registered until add_procs() which gets called after it is selected. I'll put together a setup with multiple MTLs to confirm. |
#409 applied, built and tested OK with mtl-portals4 and mtl-ofi. I used this command:
|
@tkordenbrock |
ok - thx!! |
mtl: add query method to mtl components
====================Forgotten PR needed for 1.10==========================
Switch to using the query/priority method for selecting
MTLs. This switch was motivated by the fact that now
on some platforms, its possible for multiple MTLs to
be initializable, but only one MTL should be selected.
In addition, there is a complication with the PSM and
IFO (with PSM provider) MTLs owing to the fact that
they cannot both intialize the underlying PSM context,
i.e. only one call to psm_init is allowed per process.
The mxm component has not been compiled as the author
doesn't currently have access to a system with a recent
enough mxm installed to allow for a compile.
The portals4, ofi, and psm components have been checked
for compilation. The ofi and psm components have been
checked for runtime correctness on a intel/qlogic system
with up to date PSM installed.
Bunch of impacted MTL authors:
@bureddy @regrant @adrianreber
Fixes open-mpi/ompi#735
Signed-off-by: Howard Pritchard howardp@lanl.gov