Skip to content

configury: check libnl version and warn in case of conflict #1014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 25, 2016

Conversation

ggouaillardet
Copy link
Contributor

libnl and libnl-3 are known to conflict with each other, so warn
if these two libs are used by different third party libs

return 0;
}
EOF
OPAL_LOG_COMMAND([$CC -o conftest $CFLAGS $CPPFLAGS conftest_c.$ac_ext $LDFLAGS -l$1],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to avoid OPAL_LOG_COMMAND with explicitly compiling a source and then running ldd on it (e.g., ldd doesn't exist on OS X. I know libnl isn't on OS X, but we should probably still avoid OS-specific constructs in configure when possible).

It might be better to AC_LINK_IFELSE with symbols that are known to be in libnl vs. libnl3 to determine which libnl you have. That would seem to be more portable and resilient if the output of ldd varies, or the exact names of the libraries varies between different Linux distros.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got that, thanks !
is there any risk we might hit the similar issue melannox folks reported ?
(libmpi had to be explicitly linked with libopen-pal e.g. linking with libopen-rte only was not enough,
and btw/fwiw, I have never been able to reproduce this)
also, that is a bit slower since linking must be performed twice to reach a conclusion
(e.g. failure to link with libnl specific symbol, and then with libnl3 specific symbol is required to conclude libXYZ do nor require libnl nor libnl3)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, but I don't think it's likely -- this is a slightly different situation: we're linking in libraries which have already successfully been linked. If it happens, though, we can adjust.

As for twice as long for this individual test, yes, this is true, but I'm not worry about it. configure already takes a long time on a fast machine. Also, 2 links instead of 1 is still a very short amount of time.

@ggouaillardet
Copy link
Contributor Author

ok, will do

@@ -134,6 +134,8 @@ AC_DEFUN([_OPAL_CHECK_PACKAGE_LIB], [
test "$ac_cv_search_$3" != "none required"],
[$1_LIBS="$ac_cv_search_$3 $4"],
[$1_LIBS="$4"])
# Check libnl version if any
OPAL_CHECK_LIBNL_VERSION([$2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you also need to pass $4 here, too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and also $3? I.e., if you don't link in a symbol from the library in the LIBNL test, then it might not actually pull in the $2 library...

@ggouaillardet ggouaillardet force-pushed the poc/libnl_version_check branch 2 times, most recently from f50eb49 to 3b7cb18 Compare October 14, 2015 04:51
@ggouaillardet
Copy link
Contributor Author

@jsquyres i am stuck right now :-(

i was naively hoping the following program would link successfully

int (*ibv_open_device)(void);
int nl_geterror(void);

int main (int argc, char *argv[]) {
    ibv_open_device();
    nl_geterror();
    return 0;
}

with

gcc conftest.c -libverbs

unfortunatly, i get the following error

/usr/bin/ld: /tmp/cccPwM0X.o: undefined reference to symbol 'nl_geterror'
/usr/bin/ld: note: 'nl_geterror' is defined in DSO /lib64/libnl.so.1 so try adding it to the linker command line
/lib64/libnl.so.1: could not read symbols: Invalid operation
collect2: error: ld returned 1 exit status

note if i explicitly add -lnl, it links successfully

bottom line, i am back to the ldd option, which is known not to work at least on OSX

any thoughts ?

@jsquyres
Copy link
Member

@ggouaillardet Oy. Hmm. No immediate thoughts... but I'll think about this today. Perhaps I'll get a good idea by the time you get to work tomorrow...

@jsquyres
Copy link
Member

According to http://www.infradead.org/~tgr/libnl/, libnl is Linux-specific. So we should probably disable all this checking on non-Linux systems.

That being said, if we can't rely on libraries' implicit dependencies to resolve libnl symbols, then I, too, can't think of a better option than ldd. 😦 I'm worry about the fragility of output from ldd (e.g., if it changes over time/versions). Make your check be robust enough to handle if it can't understand the output from ldd, I guess.

@ggouaillardet
Copy link
Contributor Author

i returned to ldd
ldd test is pretty bulletproof ... it is a dumb grep with libnl.so or libnl-3.so.
if no match, then assume no libnl whatsoever is used.
this test does not assume any ldd output format (such as => )
if that looks good to you, i will squeeze this PR and commit it

@jsquyres
Copy link
Member

@ggouaillardet I'm trying to replicate the constructor segv you saw in #830. Can you tell me what versions of libnl and libnl3 you have installed?

@ggouaillardet
Copy link
Contributor Author

I will check that on Monday
This is a centos 7 box, and the crash occurs when running ompi_info --all

AC_DEFUN([OPAL_LIBNL_SANITY_INIT], [
opal_libnl_version=0
opal_libnl_libs=""
opal_libnl3_libs=""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: the "" isn't necessary here. Just foo=.

@lanl-ompi
Copy link
Contributor

Test FAILed.

2 similar comments
@lanl-ompi
Copy link
Contributor

Test FAILed.

@lanl-ompi
Copy link
Contributor

Test FAILed.

@ggouaillardet ggouaillardet force-pushed the poc/libnl_version_check branch 2 times, most recently from f4a67d7 to 1f0b567 Compare October 19, 2015 04:16
@lanl-ompi
Copy link
Contributor

Test FAILed.

4 similar comments
@lanl-ompi
Copy link
Contributor

Test FAILed.

@lanl-ompi
Copy link
Contributor

Test FAILed.

@lanl-ompi
Copy link
Contributor

Test FAILed.

@lanl-ompi
Copy link
Contributor

Test FAILed.

@ggouaillardet ggouaillardet force-pushed the poc/libnl_version_check branch from 1f0b567 to c33a58d Compare October 19, 2015 04:25
@lanl-ompi
Copy link
Contributor

Test FAILed.

@jsquyres jsquyres modified the milestones: v2.2.0, v2.x Oct 13, 2016
dnl
dnl $HEADER$
dnl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments here about what the overall strategy is for the libnl checking? E.g., how the macros are intended to be used in this .m4 file, why we are doing these checks, ...etc. Just something for those who come after us to have a clue as to why we are going to all this trouble for libnl, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1449,6 +1449,20 @@ m4_ifdef([project_orte], [ORTE_CONFIG_FILES])
m4_ifdef([project_ompi], [OMPI_CONFIG_FILES])
m4_ifdef([project_oshmem], [OSHMEM_CONFIG_FILES])

OPAL_UNIQ([opal_libnlv1_libs])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here -- can you put some comments explaining what we're doing (even if it's mainly a reference to the main explanation in the .m4 file).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i moved the macro and added some comment

dnl $1: variable name
dnl $2: --with-<foo> value
dnl
AC_DEFUN([OPAL_LIBNL_PARSE_WITH],[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what the purpose of this macro is -- did we use to call this for both libnl and libnl3 (i.e., that's why there's a substitution on $1)?

I ask because I only see this macro being called for libnl now, which makes me wonder why this is still a macro with $1 and $2 substitution...? Indeed, there's hard-coded places below where you check the value of $opal_want_libnl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i inlined this macro


AC_SUBST(opal_reachable_netlink_LIBNL_CPPFLAGS)
AC_SUBST(opal_reachable_netlink_LIBNL_LIBS)
[OPAL_REACHABLE_NETLINK_CHECK_LIBNL_xxx])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize why you used the _xxx suffix (to distinguish that it's not "libnl"), but it's still a bit weird...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i replaced the name with OPAL_REACHABLE_NETLINK_CHECK_LIBNL_Vx

dnl
dnl --------------------------------------------------------
AC_DEFUN([OPAL_REACHABLE_NETLINK_CHECK_LIBNL3],[
AC_DEFUN([OPAL_REACHABLE_NETLINK_CHECK_LIBNL_xxx],[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this macro be in config/opal_check_libnl.m4? It seems like we want each component that needs to link against libnl to have a configure.m4 that makes a trivial macro call to do all the libnl magic (e.g., similar to how opal/mca/btl/openib calls OPAL_CHECK_OPENFABRICS to do the bulk of its work).

default)
# Default case -- try and see if we can find it
opal_want_$1=default
opal_$1_location=/usr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting a -I or -L to /usr is almost never what you want to do. Is that what happens with this value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this value is never used this way. it is passed to OPAL_CHECK_PACKAGE in which /usr or /usr/local is kind of removed

@ggouaillardet ggouaillardet force-pushed the poc/libnl_version_check branch from 9bb7fe1 to 60755dd Compare October 14, 2016 04:16
dnl libnl-3.so.200 and friends, while a naive implementation of
dnl our configure logic would link libnl.so.1 to libdaplusnic,
dnl resulting in both versions in the dependency map at the same
dnl time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggouaillardet I think you might want to edit this comment (vs. blind copy-n-paste). E.g., the references here to DAPL aren't relevant. 😄

What I'm looking for here is both a description of the problem (which this first paragraph essentially does), and also at least some description of the overall strategy we're doing here in this M4 file to try to avoid the problem (e.g., after each OPAL_CHECK_PACKAGE, check to see if any of the dependent libraries include libnl/libnl3, detect conflicts, ...etc.).

default)
# Default case -- try and see if we can find it
opal_want_libnl=default
opal_libnl_location=/usr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this /usr is just going to be stripped away in OPAL_CHECK_PACKAGE, then we probably shouldn't even list it here (i.e., leave it blank), because every time someone looks at this, they'll say "Oh, you shouldn't -I/usr or -L/usr, and spend time tracking down just to figure out that it's both useless and harmless.

# libnl v1 and libnl3 are known *not* to coexist
# for each library, figure out whether it depends on libnl or libnl3 or none
# so conflicts can be reported and/or prevented
OPAL_LIBNL_SANITY_CHECK([$2], [$3], [$$1_LIBS])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does one of these $'s need to be m4-escaped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is ok as it is

[AC_MSG_RESULT([$opal_libnlv3_libs])],
[AC_MSG_RESULT([(none)])])
AS_IF([test -n "$opal_libnlv1_libs" && test -n "$opal_libnlv3_libs"],
[AC_MSG_WARN([Both libnl v1 and libnl v3 are required])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of the word "required" here is a bit confusing -- a user could read that as "to build Open MPI, it's telling me that both libnl v1 and v3 are required, but then it also tells me that this is an error!"

Maybe s/are required/have been found as dependent libraries/ ...?

ggouaillardet added a commit to ggouaillardet/ompi that referenced this pull request Oct 17, 2016
@ggouaillardet
Copy link
Contributor Author

@jsquyres i pushed an other commit with the requested changes (plus a bit of revamp)
if it works for you, i will squash, rebase and merge

@ggouaillardet
Copy link
Contributor Author

@hppritcha the URL sets by LANL-CI Jenkins is invalid, can you please have a look ?

@ibm-ompi
Copy link

Build Failed with GNU compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/dc4747f4f5fd8e7c6c05fa37dbf4c5d1

@ibm-ompi
Copy link

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/38549f772045a006e10d3ac771bbe9c7

@ibm-ompi
Copy link

Build Failed with PGI compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/642ced56860259e02256422416b925e3

@jsquyres
Copy link
Member

bot:retest

@ibm-ompi
Copy link

Build Failed with GNU compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/6af01971e176ba666d0c8f8022d1a495

@ibm-ompi
Copy link

Build Failed with PGI compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/236a41a0cec1ff3049103eaac39a9c80

@jsquyres
Copy link
Member

@ggouaillardet The errors that are coming back seem legit.

@ibm-ompi
Copy link

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/ba0898d40197da70870ab3c6be59a10a

@ggouaillardet
Copy link
Contributor Author

@jsquyres i will fix that. that seems to happen on systems without any libnl devel packages

@ggouaillardet
Copy link
Contributor Author

@jsquyres i fixed that and updated the PR

@jsquyres
Copy link
Member

@ggouaillardet This looks good to me now; thanks!

Can you squash it down to one commit? I think we'll then be good to go.

libnl and libnl-3 are known to conflict with each other, so detect
and abort if these two libs are both used directly (e.g. Open MPI
uses libnl-3) or indirectly (e.g. libibverbs.so might depend on libnl)
@ggouaillardet ggouaillardet force-pushed the poc/libnl_version_check branch from c9240d8 to f2a80dc Compare October 25, 2016 00:24
@ggouaillardet
Copy link
Contributor Author

Done !

@jsquyres jsquyres merged commit 582f290 into open-mpi:master Oct 25, 2016
@jsquyres
Copy link
Member

@ggouaillardet Let's let this soak on master for a bit. Given that v2.1.0 was deleted, there may even still be time to make it into v2.1.0. If not, v2.2.0 sounds like a fine target.

@ggouaillardet
Copy link
Contributor Author

From a pragmatic point of view, the issue was only reported with mca/reachable, and it is not in v2.x
So there is no rush to PR this

@jsquyres
Copy link
Member

FWIW: we've seen this issue between the usnic BTL and the openib BTL

  • usnic: libfabric -> libnl
  • openib: libverbs -> librdmacm -> libnl

I agree: this is not urgent. But it would be nice to get into the v2.x train at some point.

@jsquyres jsquyres modified the milestones: v2.1.0, v2.2.0 Feb 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants