Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

mtl: add query method to mtl components #409

Merged
merged 1 commit into from
Jul 30, 2015

Conversation

hppritcha
Copy link
Member

====================Forgotten PR needed for 1.10==========================

Switch to using the query/priority method for selecting
MTLs. This switch was motivated by the fact that now
on some platforms, its possible for multiple MTLs to
be initializable, but only one MTL should be selected.

In addition, there is a complication with the PSM and
IFO (with PSM provider) MTLs owing to the fact that
they cannot both intialize the underlying PSM context,
i.e. only one call to psm_init is allowed per process.

The mxm component has not been compiled as the author
doesn't currently have access to a system with a recent
enough mxm installed to allow for a compile.

The portals4, ofi, and psm components have been checked
for compilation. The ofi and psm components have been
checked for runtime correctness on a intel/qlogic system
with up to date PSM installed.

Bunch of impacted MTL authors:

@bureddy @regrant @adrianreber

Fixes open-mpi/ompi#735

Signed-off-by: Howard Pritchard howardp@lanl.gov

@hppritcha hppritcha added the bug label Jul 22, 2015
@hppritcha hppritcha added this to the v1.10.0 milestone Jul 22, 2015
@jsquyres
Copy link
Member

@hppritcha Ran into some compile-time errors in the MXM component:

18:48:24 mtl_mxm_component.c: In function 'ompi_mtl_mxm_component_register':
18:48:24 mtl_mxm_component.c:94: error: 'MCA_COMPILETIME_VER' undeclared (first use in this function)
18:48:24 mtl_mxm_component.c:94: error: (Each undeclared identifier is reported only once
18:48:24 mtl_mxm_component.c:94: error: for each function it appears in.)
18:48:24 mtl_mxm_component.c:96: error: 'MCA_BASE_VAR_TYPE_VERSION_STRING' undeclared (first use in this function)
18:48:24 mtl_mxm_component.c:113: error: 'MCA_RUNTIME_VER' undeclared (first use in this function)
18:48:24 mtl_mxm_component.c: In function 'ompi_mtl_mxm_component_query':
18:48:24 mtl_mxm_component.c:262: warning: assignment from incompatible pointer type
18:48:24 mtl_mxm.c: In function 'ompi_mtl_mxm_del_procs':
18:48:24 mtl_mxm.c:516: warning: comparison between signed and unsigned integer expressions

@hppritcha
Copy link
Member Author

botched the cross reference to commit, updating.

@hppritcha hppritcha force-pushed the topic/add_query_method_to_mtls branch from 041aa65 to 3f8e41d Compare July 23, 2015 16:53
@PHHargrove
Copy link
Member

Cannot compile mtl:ofi with this PR because ompi/mca/mtl/ofi/mtl_ofi_component.c contains conflict markers!!

@@ -87,11 +87,19 @@ ompi_mtl_ofi_component_open(void)
return OMPI_SUCCESS;
}

<<<<<<< HEAD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these should be removed.

@PHHargrove
Copy link
Member

A build w/ portals4 (but without ofi) shows not only the swapped initializers, but also a missing cast:

/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-icc-14/openmpi-1.10.0rc2/ompi/mca/mtl/portals4/mtl_portals4_component.c(56): warning #144: a value of type "int (*)(mca_base_module_t **, int *)" cannot be used to initialize an entity of type "mca_base_close_component_1_0_0_fn_t"
          ompi_mtl_portals4_component_query,  /* component close */
          ^

/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-icc-14/openmpi-1.10.0rc2/ompi/mca/mtl/portals4/mtl_portals4_component.c(57): warning #144: a value of type "int (*)(void)" cannot be used to initialize an entity of type "mca_base_query_component_2_0_0_fn_t"
          ompi_mtl_portals4_component_close, /* component close */
          ^

/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.10.0rc2-linux-x86_64-icc-14/openmpi-1.10.0rc2/ompi/mca/mtl/portals4/mtl_portals4_component.c(232): warning #556: a value of type "mca_mtl_base_module_t *" cannot be assigned to an entity of type "mca_base_module_t *"
      *module = &ompi_mtl_portals4.base;
              ^

By inspection I suspect ompi_mtl_psm_component_query() could have the same issue (but am not sure about types of base vs super), and that the unresolved conflict in ompi_mtl_ofi_component_query() needs to be resolved in favor of the HEAD (which has the cast).

Switch to using the query/priority method for selecting
MTLs.  This switch was motivated by the fact that now
on some platforms, its possible for multiple MTLs to
be initializable, but only one MTL should be selected.

In addition, there is a complication with the PSM and
IFO (with PSM provider) MTLs owing to the fact that
they cannot both intialize the underlying PSM context,
i.e. only one call to psm_init is allowed per process.

The mxm component has not been compiled as the author
doesn't currently have access to a system with a recent
enough mxm installed to allow for a compile.

The portals4, ofi, and psm components have been checked
for compilation.  The ofi and psm components have been
checked for runtime correctness on a intel/qlogic system
with up to date PSM installed.

Manually patch mxm, removing part of commit open-mpi/ompi@a4990de0

Fixes open-mpi/ompi#735

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@hppritcha hppritcha force-pushed the topic/add_query_method_to_mtls branch from 3f8e41d to 6daef31 Compare July 23, 2015 22:13
@hppritcha
Copy link
Member Author

okay pushing fixes up. Some of these - like the portals4 problem - was actually fixed in another pr in master. Didn' realize I was not compiling the ofi mtl on my qlogic box. That's fixed. But the machine is very busy and I can't get a session to test. I'll push new changes.

@yburette could you test on one of your systems?

@hppritcha
Copy link
Member Author

@yburette found a bug in the ofi mtl which will need to be fixed to avoid the segfault @PHHargrove is seeing. I'll open a PR shortly on master for you to review.

@yburette
Copy link
Member

@hppritcha I finally got some True Scale machines to try this PR. It is working for me. The PSM MTL is being selected automatically.

@PHHargrove I also tried without any parameter (apart from -n and -H) on the command line and it seems to work.

@rhc54
Copy link

rhc54 commented Jul 24, 2015

@yburette The key to recreating the problem is that you have to get an MTL to build, but not have it be selected at runtime - without specifying the ^mtl param so it will be opened and then closed. Otherwise, it will always be selected and the problem won't occur.

@hppritcha
Copy link
Member Author

the truescale machine here is less busy today so I can run stuff. Ralph is right - for 1.10. You have to force selection of mtl to be ofi, but because of the not-so-great things going on in the cm select method, you end up using the ob1 pml. On master, the not-so-great things going on in the cm select method allow things to work -depending on your compiler.

Any rate, @yburette please review open-mpi/ompi#747.

@yburette
Copy link
Member

@rhc54, I agree. I forgot to mention that both the PSM and OFI MTLs were built.

[f011:125594] mca: base: components_open: component psm open function successful
[f011:125594] mca: base: components_open: found loaded component ofi
[f011:125594] mca: base: components_open: component ofi open function successful
[f011:125594] mca:base:select: Auto-selecting mtl components
[f011:125594] mca:base:select:(  mtl) Querying component [psm]
[f011:125594] mca:base:select:(  mtl) Query of component [psm] set priority to 100
[f011:125594] mca:base:select:(  mtl) Querying component [ofi]
[f011:125594] mca:base:select:(  mtl) Query of component [ofi] set priority to 10
[f011:125594] mca:base:select:(  mtl) Selected component [psm]
[f011:125594] mca: base: close: component ofi closed
[f011:125594] mca: base: close: unloading component ofi
[f011:125594] select: initializing mtl component psm

Am I missing anything?

@hppritcha
Copy link
Member Author

@yburette
what is your mpirun command line look like?

@yburette
Copy link
Member

mpirun -host f010,f011 -n 2 ./test

@hppritcha
Copy link
Member Author

please change to

mpirun -host f010,f011 -n 2 --mca mtl ofi ./test

@hppritcha
Copy link
Member Author

i'm assuming your working with the 1.10 release candidate.

@yburette
Copy link
Member

I'm working with this PR.

When I add --mca mtl ofi (but not --mca pml cm), I get the following error:

--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'f011', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node f011 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Is this Seg fault related to the problem with inlining you mention in open-mpi/ompi/pull/747?

@yburette
Copy link
Member

Finally read the mailing list :-)

I added the call to opal_progress_unregister() in the OFI MTL's close() function and I don't see the Seg fault anymore:

@@ -98,7 +98,18 @@ ompi_mtl_ofi_component_query(mca_base_module_t **module, int *priority)
 static int
 ompi_mtl_ofi_component_close(void)
 {
-    return OMPI_SUCCESS;
+    int ret;
+
+    /**
+     * Unregister progress callback.
+     */
+    ret = opal_progress_unregister(ompi_mtl_ofi_progress);
+    if (OMPI_SUCCESS != ret) {
+        opal_output_verbose(1, ompi_mtl_base_framework.framework_output,
+                            "%s:%d: opal_progress_unregister failed: %d\n",
+                            __FILE__, __LINE__, ret);
+    }
+    return ret;
 }

@PHHargrove
Copy link
Member

@yburette
Is the patch in the previous comment in git anywhere?
I need something I can script patching from, and the issue tracker doesn't cut it.

@hppritcha
Copy link
Member Author

@yburette the segfault you are reporting above appears to be something new.

@jsquyres
Copy link
Member

@yburette Does this address the potential inlining of the progress function (Howard mentioned he saw problems w.r.t. the inlining -- but it may have been related to not unregistering during close...?).

@yburette
Copy link
Member

@PHHargrove yes, I have just pushed it to my fork

yburette/ompi-release@4dc4d15

Hope this helps

@yburette
Copy link
Member

@hppritcha @jsquyres
Ralph mentioned on the mailing list that not un-registering the progress callback could create a Seg Fault during Finalize.

This definitely seems to fix the issue I reported this morning.

Howard, what would be the best way for me to reproduce the issue w.r.t. inlining?

@hppritcha
Copy link
Member Author

yburette/ompi-release@4dc4d15 is equivalent to open-mpi/ompi#747 although the possible problem of passing an inline function to opal_progress_register may reoccur since the progress function is still marked as inlineable.

I've never observed the "There are more than one active ports on host..." message. I think that's something different, probably owing to the ofi/psm provider as well as the openib btl trying to initialize.

@PHHargrove
Copy link
Member

@hppritcha @yburette

Short version: Yohann's commit 4dc4d15 and open-mpi/ompi#747 are the same for me.

Longer version:

I previously reported on the mailing list this PR + open-mpi/ompi#747 worked when passing no args other than "-np 2" to mpirun.
As Howard observes, Yohann's commit 4dc4d15 is a subset of the changes in ompi#474.
So it should come as no surprise that this PR + yburette/ompi-release@4dc4d15 works just fine with no extra args.

However, I still see (just as reported on the mailing list with respect to 747) a SEGV if I run "mpirun -mca btl sm,self -np 1 examples/ring_c" unless I also add "-mca mtl ^ofi" or "-mca mtl portals4".

@PHHargrove
Copy link
Member

@yburette

The "There are more than one active ports on host..." message is something I've seen before on systems with multiple ports on the same IB network, and is the subject of an FAQ Entry. On systems I use that are connected this way, I set the MCA parameter described in that FAQ entry to disable the warning.

@tkordenbrock
Copy link
Member

@hppritcha

#409 applied, built and tested OK with mtl-portals4 in a single MTL build. All my setups are mtl-portals4 only, so I haven't been able to check if it segfaults if it is not selected.

I don't think mtl-portals4 is susceptible because progress is not registered until add_procs() which gets called after it is selected. I'll put together a setup with multiple MTLs to confirm.

@tkordenbrock
Copy link
Member

@hppritcha

#409 applied, built and tested OK with mtl-portals4 and mtl-ofi. I used this command:

$OMPI_110/bin/mpirun -n 2 -N 1 --mca pml cm --mca mtl_ofi_priority 100 --mca mtl_portals4_priority 50 ./sendrecv

@hppritcha
Copy link
Member Author

@tkordenbrock
thanks
@rhc54 I think this can be merged in now.

@rhc54
Copy link

rhc54 commented Jul 30, 2015

ok - thx!!

rhc54 pushed a commit that referenced this pull request Jul 30, 2015
@rhc54 rhc54 merged commit 671f995 into open-mpi:v1.10 Jul 30, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants