Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Dec 19, 2015

No description provided.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 20, 2015

@hppritcha Can you tell me something more about how your disable-dlopen is being configured? I cannot replicate this error, even with --disable-dlopen in the configure. I also cannot replicate a "make distcheck" error. Everything seems to be building just fine.

@hppritcha
Copy link
Member

The disable dlopen check is not doing anything special. It is using gnu 5.2 compilers. It looks like mpicc is not pulling in the pmix lib.

The distcheck is a simple autogen.pl && make -j 4 distcheck
This one uses gni 4.9 as I recall.

@hppritcha
Copy link
Member

I notice something really weird in the install directory for the dlopen test. There are two libpmi libs!
one libpmix.so.2.0.0 and the other libpmix.so.0.0.0. The former is what opal-pal.so is linked against,
but it has OPAL_PMIX_PMIX112 symbols whereas the libpmix.so.0.0.0 has the OPAL_PMIX_PMIX120 symbols defined in it. There must be some sort of automake issue with the way the pmix is being installed and mpicc is linking. Do we really want two libpmix's being installed?

@hppritcha
Copy link
Member

Hmmm...I don't think using --disable-dlopen with the way things are, that we would want to build multiple versions of libpmix with different symbol names. Perhaps renaming the libraries to be something like libpmix_112.so.... and libpmix_120.so....would enable a --disable-dlopen environment to work.

@ggouaillardet
Copy link
Contributor

@rhc54
@hppritcha pointed one issue (two libpmix.la libs)
i found an other issue (missing include)
these are fixed in #1244

there is now a crash at runtime, i think the root cause is some names are still conflicting
for example, pmix_globals_init is both defined in libpmix112.la and libpmix.la

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 22, 2015

@miked-mellanox I'm getting an error in the oshmem area when I try to build --enable-static --disable-shared:

/usr/bin/ld: ../../../oshmem/.libs/liboshmem.a(memheap_base_static.o): undefined reference to symbol '_end'
/usr/bin/ld: note: '_end' is defined in DSO /usr/lib64/libnl-route-3.so.200 so try adding it to the linker command line
/usr/lib64/libnl-route-3.so.200: could not read symbols: Invalid operation
collect2: error: ld returned 1 exit status
make[2]: *** [oshmem_info] Error 1
make[1]: *** [all-recursive] Error 1

Not sure if you have tried this recently?

@mike-dubman
Copy link
Member

  • hmm.... "_end" symbol used by memheap and comes from built-in ptmalloc2 code.
  • now it is also available in libnl

@rhc54 - what distro you are using?

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 22, 2015

CentOS 7

@mike-dubman
Copy link
Member

@igor-ivanov - could you please check. thanks

@igor-ivanov
Copy link
Member

@miked-mellanox _end symbol is used in memheap to check valid segments related oshmem symmetric area. The same issue is here http://www.open-mpi.org/community/lists/users/2015/11/28014.php
Probably the direct fix is linking memheap library with -lnl. Will check.

@ggouaillardet
Copy link
Contributor

#1014 is a first step.
on centos7 infiniband libraries depend on libnl
so far, the only issue I met was the reachable/netlink component try to use libnl3 if libnl3 devel headers are found.
in ths case, #1014 would force reachable/netlink use libnl instead of libnl3
fwiw, without #1014 ompi_info --all crashes on my centos 7 box with libnl3-devel

in this case, which libraries are using libnl and which are using libnl3 ?
if case infiniband libraries might depend on libnl3, oshmem should be updated so it can use libnl3

@hppritcha hppritcha mentioned this pull request Dec 23, 2015
@ggouaillardet
Copy link
Contributor

on second thought, that might be a different issue
_end symbol is not found and the linker is not happy about that.
I will dig a bit, should _end be declared as a weak symbol instead ?

@ggouaillardet
Copy link
Contributor

@rhc54 i cannot reproduce this issue on my most up-to-date centos 7 vm
btw, i found it odd /usr/bin/ld is involved since you configure'd with --disable-shared --enable-static
can you post some more info on where the error occurs ?

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 24, 2015

Not until next week as I'm away from that cluster. Will recheck upon return

@ggouaillardet
Copy link
Contributor

@rhc54 with configure'd with --disable-dlopen, i got an error because pmix_munge_module is defined twice

i fixed that with this inline patch

diff --git a/opal/mca/pmix/pmix112/pmix/include/pmix/rename.h b/opal/mca/pmix/pmix112/pmix/include/pmix/rename.h
index f5ecc8f..7143865 100644
--- a/opal/mca/pmix/pmix112/pmix/include/pmix/rename.h
+++ b/opal/mca/pmix/pmix112/pmix/include/pmix/rename.h
@@ -319,6 +319,7 @@ BEGIN_C_DECLS
 #define pmix_list_sort                           PMIX_NAME(list_sort)
 #define pmix_list_splice                         PMIX_NAME(list_splice)
 #define pmix_list_t_class                        PMIX_NAME(list_t_class)
+#define pmix_munge_module                        PMIX_NAME(munge_module)
 #define pmix_native_module                       PMIX_NAME(native_module)
 #define pmix_notify_caddy_t_class                PMIX_NAME(notify_caddy_t_class)
 #define pmix_nrec_t_class                        PMIX_NAME(nrec_t_class)
diff --git a/opal/mca/pmix/pmix120/pmix/include/pmix/rename.h b/opal/mca/pmix/pmix120/pmix/include/pmix/rename.h
index f5ecc8f..7143865 100644
--- a/opal/mca/pmix/pmix120/pmix/include/pmix/rename.h
+++ b/opal/mca/pmix/pmix120/pmix/include/pmix/rename.h
@@ -319,6 +319,7 @@ BEGIN_C_DECLS
 #define pmix_list_sort                           PMIX_NAME(list_sort)
 #define pmix_list_splice                         PMIX_NAME(list_splice)
 #define pmix_list_t_class                        PMIX_NAME(list_t_class)
+#define pmix_munge_module                        PMIX_NAME(munge_module)
 #define pmix_native_module                       PMIX_NAME(native_module)
 #define pmix_notify_caddy_t_class                PMIX_NAME(notify_caddy_t_class)
 #define pmix_nrec_t_class                        PMIX_NAME(nrec_t_class)

…x API.

Update the configure logic for the new pmix120 component

Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works

Cleanup the rename files to use the pretty macros

Add Gilles change: missing rename
@rhc54
Copy link
Contributor Author

rhc54 commented Dec 28, 2015

@ggouaillardet Thanks - added it

@ggouaillardet
Copy link
Contributor

@rhc54 i made #1266 that includes your commits and fixes other misc issues

i can now build with and without --disable-dlopen

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 29, 2015

Instead of us bouncing back/forth on these PRs, which is really confusing, I'd be willing to grant you write permission on my branch if you only asked. Frankly, this going back and forth makes it very hard to follow what you are doing relative to changing what has been done.

I can live with it this time, but let's avoid this confusion in the future, okay?

@rhc54 rhc54 closed this Dec 29, 2015
@rhc54 rhc54 deleted the topic/pmix120 branch December 29, 2015 05:18
jsquyres added a commit to jsquyres/ompi that referenced this pull request Aug 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants