-
Notifications
You must be signed in to change notification settings - Fork 912
configury: check libnl version and warn in case of conflict #1014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
configury: check libnl version and warn in case of conflict #1014
Conversation
return 0; | ||
} | ||
EOF | ||
OPAL_LOG_COMMAND([$CC -o conftest $CFLAGS $CPPFLAGS conftest_c.$ac_ext $LDFLAGS -l$1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to avoid OPAL_LOG_COMMAND with explicitly compiling a source and then running ldd on it (e.g., ldd doesn't exist on OS X. I know libnl isn't on OS X, but we should probably still avoid OS-specific constructs in configure when possible).
It might be better to AC_LINK_IFELSE with symbols that are known to be in libnl vs. libnl3 to determine which libnl you have. That would seem to be more portable and resilient if the output of ldd varies, or the exact names of the libraries varies between different Linux distros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got that, thanks !
is there any risk we might hit the similar issue melannox folks reported ?
(libmpi had to be explicitly linked with libopen-pal e.g. linking with libopen-rte only was not enough,
and btw/fwiw, I have never been able to reproduce this)
also, that is a bit slower since linking must be performed twice to reach a conclusion
(e.g. failure to link with libnl specific symbol, and then with libnl3 specific symbol is required to conclude libXYZ do nor require libnl nor libnl3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible, but I don't think it's likely -- this is a slightly different situation: we're linking in libraries which have already successfully been linked. If it happens, though, we can adjust.
As for twice as long for this individual test, yes, this is true, but I'm not worry about it. configure already takes a long time on a fast machine. Also, 2 links instead of 1 is still a very short amount of time.
ok, will do |
@@ -134,6 +134,8 @@ AC_DEFUN([_OPAL_CHECK_PACKAGE_LIB], [ | |||
test "$ac_cv_search_$3" != "none required"], | |||
[$1_LIBS="$ac_cv_search_$3 $4"], | |||
[$1_LIBS="$4"]) | |||
# Check libnl version if any | |||
OPAL_CHECK_LIBNL_VERSION([$2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you also need to pass $4 here, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and also $3? I.e., if you don't link in a symbol from the library in the LIBNL test, then it might not actually pull in the $2 library...
f50eb49
to
3b7cb18
Compare
@jsquyres i am stuck right now :-( i was naively hoping the following program would link successfully int (*ibv_open_device)(void);
int nl_geterror(void);
int main (int argc, char *argv[]) {
ibv_open_device();
nl_geterror();
return 0;
} with
unfortunatly, i get the following error
note if i explicitly add bottom line, i am back to the ldd option, which is known not to work at least on OSX any thoughts ? |
@ggouaillardet Oy. Hmm. No immediate thoughts... but I'll think about this today. Perhaps I'll get a good idea by the time you get to work tomorrow... |
According to http://www.infradead.org/~tgr/libnl/, libnl is Linux-specific. So we should probably disable all this checking on non-Linux systems. That being said, if we can't rely on libraries' implicit dependencies to resolve libnl symbols, then I, too, can't think of a better option than ldd. 😦 I'm worry about the fragility of output from ldd (e.g., if it changes over time/versions). Make your check be robust enough to handle if it can't understand the output from ldd, I guess. |
i returned to ldd |
@ggouaillardet I'm trying to replicate the constructor segv you saw in #830. Can you tell me what versions of libnl and libnl3 you have installed? |
I will check that on Monday |
AC_DEFUN([OPAL_LIBNL_SANITY_INIT], [ | ||
opal_libnl_version=0 | ||
opal_libnl_libs="" | ||
opal_libnl3_libs="" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: the "" isn't necessary here. Just foo=
.
Test FAILed. |
2 similar comments
Test FAILed. |
Test FAILed. |
f4a67d7
to
1f0b567
Compare
Test FAILed. |
4 similar comments
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
1f0b567
to
c33a58d
Compare
Test FAILed. |
dnl | ||
dnl $HEADER$ | ||
dnl | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments here about what the overall strategy is for the libnl checking? E.g., how the macros are intended to be used in this .m4 file, why we are doing these checks, ...etc. Just something for those who come after us to have a clue as to why we are going to all this trouble for libnl, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1449,6 +1449,20 @@ m4_ifdef([project_orte], [ORTE_CONFIG_FILES]) | |||
m4_ifdef([project_ompi], [OMPI_CONFIG_FILES]) | |||
m4_ifdef([project_oshmem], [OSHMEM_CONFIG_FILES]) | |||
|
|||
OPAL_UNIQ([opal_libnlv1_libs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here -- can you put some comments explaining what we're doing (even if it's mainly a reference to the main explanation in the .m4 file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i moved the macro and added some comment
dnl $1: variable name | ||
dnl $2: --with-<foo> value | ||
dnl | ||
AC_DEFUN([OPAL_LIBNL_PARSE_WITH],[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure what the purpose of this macro is -- did we use to call this for both libnl
and libnl3
(i.e., that's why there's a substitution on $1
)?
I ask because I only see this macro being called for libnl
now, which makes me wonder why this is still a macro with $1
and $2
substitution...? Indeed, there's hard-coded places below where you check the value of $opal_want_libnl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i inlined this macro
|
||
AC_SUBST(opal_reachable_netlink_LIBNL_CPPFLAGS) | ||
AC_SUBST(opal_reachable_netlink_LIBNL_LIBS) | ||
[OPAL_REACHABLE_NETLINK_CHECK_LIBNL_xxx]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize why you used the _xxx
suffix (to distinguish that it's not "libnl"), but it's still a bit weird...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i replaced the name with OPAL_REACHABLE_NETLINK_CHECK_LIBNL_Vx
dnl | ||
dnl -------------------------------------------------------- | ||
AC_DEFUN([OPAL_REACHABLE_NETLINK_CHECK_LIBNL3],[ | ||
AC_DEFUN([OPAL_REACHABLE_NETLINK_CHECK_LIBNL_xxx],[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this macro be in config/opal_check_libnl.m4
? It seems like we want each component that needs to link against libnl to have a configure.m4
that makes a trivial macro call to do all the libnl magic (e.g., similar to how opal/mca/btl/openib
calls OPAL_CHECK_OPENFABRICS
to do the bulk of its work).
default) | ||
# Default case -- try and see if we can find it | ||
opal_want_$1=default | ||
opal_$1_location=/usr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting a -I
or -L
to /usr
is almost never what you want to do. Is that what happens with this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this value is never used this way. it is passed to OPAL_CHECK_PACKAGE
in which /usr
or /usr/local
is kind of removed
9bb7fe1
to
60755dd
Compare
dnl libnl-3.so.200 and friends, while a naive implementation of | ||
dnl our configure logic would link libnl.so.1 to libdaplusnic, | ||
dnl resulting in both versions in the dependency map at the same | ||
dnl time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggouaillardet I think you might want to edit this comment (vs. blind copy-n-paste). E.g., the references here to DAPL aren't relevant. 😄
What I'm looking for here is both a description of the problem (which this first paragraph essentially does), and also at least some description of the overall strategy we're doing here in this M4 file to try to avoid the problem (e.g., after each OPAL_CHECK_PACKAGE, check to see if any of the dependent libraries include libnl/libnl3, detect conflicts, ...etc.).
default) | ||
# Default case -- try and see if we can find it | ||
opal_want_libnl=default | ||
opal_libnl_location=/usr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this /usr is just going to be stripped away in OPAL_CHECK_PACKAGE, then we probably shouldn't even list it here (i.e., leave it blank), because every time someone looks at this, they'll say "Oh, you shouldn't -I/usr
or -L/usr
, and spend time tracking down just to figure out that it's both useless and harmless.
# libnl v1 and libnl3 are known *not* to coexist | ||
# for each library, figure out whether it depends on libnl or libnl3 or none | ||
# so conflicts can be reported and/or prevented | ||
OPAL_LIBNL_SANITY_CHECK([$2], [$3], [$$1_LIBS]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does one of these $
's need to be m4-escaped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it is ok as it is
[AC_MSG_RESULT([$opal_libnlv3_libs])], | ||
[AC_MSG_RESULT([(none)])]) | ||
AS_IF([test -n "$opal_libnlv1_libs" && test -n "$opal_libnlv3_libs"], | ||
[AC_MSG_WARN([Both libnl v1 and libnl v3 are required]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of the word "required" here is a bit confusing -- a user could read that as "to build Open MPI, it's telling me that both libnl v1 and v3 are required, but then it also tells me that this is an error!"
Maybe s/are required/have been found as dependent libraries/ ...?
@jsquyres i pushed an other commit with the requested changes (plus a bit of revamp) |
@hppritcha the URL sets by |
Build Failed with GNU compiler! Please review the log, and get in touch if you have questions. Gist: https://gist.github.com/dc4747f4f5fd8e7c6c05fa37dbf4c5d1 |
Build Failed with XL compiler! Please review the log, and get in touch if you have questions. Gist: https://gist.github.com/38549f772045a006e10d3ac771bbe9c7 |
Build Failed with PGI compiler! Please review the log, and get in touch if you have questions. Gist: https://gist.github.com/642ced56860259e02256422416b925e3 |
bot:retest |
Build Failed with GNU compiler! Please review the log, and get in touch if you have questions. Gist: https://gist.github.com/6af01971e176ba666d0c8f8022d1a495 |
Build Failed with PGI compiler! Please review the log, and get in touch if you have questions. Gist: https://gist.github.com/236a41a0cec1ff3049103eaac39a9c80 |
@ggouaillardet The errors that are coming back seem legit. |
Build Failed with XL compiler! Please review the log, and get in touch if you have questions. Gist: https://gist.github.com/ba0898d40197da70870ab3c6be59a10a |
@jsquyres i will fix that. that seems to happen on systems without any libnl devel packages |
@jsquyres i fixed that and updated the PR |
@ggouaillardet This looks good to me now; thanks! Can you squash it down to one commit? I think we'll then be good to go. |
libnl and libnl-3 are known to conflict with each other, so detect and abort if these two libs are both used directly (e.g. Open MPI uses libnl-3) or indirectly (e.g. libibverbs.so might depend on libnl)
c9240d8
to
f2a80dc
Compare
Done ! |
@ggouaillardet Let's let this soak on master for a bit. Given that v2.1.0 was deleted, there may even still be time to make it into v2.1.0. If not, v2.2.0 sounds like a fine target. |
From a pragmatic point of view, the issue was only reported with mca/reachable, and it is not in v2.x |
FWIW: we've seen this issue between the usnic BTL and the openib BTL
I agree: this is not urgent. But it would be nice to get into the v2.x train at some point. |
libnl and libnl-3 are known to conflict with each other, so warn
if these two libs are used by different third party libs