Skip to content

misc BSD related fixes #2110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ggouaillardet
Copy link
Contributor

No description provided.

@@ -27,6 +27,9 @@
#ifdef HAVE_STRING_H
#include <string.h>
#endif
#ifdef HAVE_SYS_STAT_H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bring that downstream to pmix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was my plan :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@PHHargrove PHHargrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have git and recent autotools on OpenBSD-6.0
Checking out the PR branch, I am now able to autogen which I could not before.

Copy link
Member

@PHHargrove PHHargrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry... having trouble w/ the new review GUI.
My comments were incomplete.

After reporting the net/if.h problem I discovered the divert_pop issue and the ethtool.c header issues. All those three seem to be fixed.

I do not know what the pmix change was intended to fix.

I don't have a fortran compiler, and so cannot comment on the quoting of '*' in the fortran configury.

@PHHargrove
Copy link
Member

I still cannot actually run anything on OpenBSD-6.0:

{openbsd6-amd64 examples}$ mpirun  -H localhost -np 2 ./hello_c
[openbsd6-amd64.my.domain:33260] [[32737,0],0] ORTE_ERROR_LOG: Not found in file ../../../../../ompi/orte/mca/ess/hnp/ess_hnp_module.c at line 616
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

@ggouaillardet
Copy link
Contributor Author

@PHHargrove what is your Open MPI configure command line ?

i am able to build with --disable-dlopen but it crashes somewhere in MPI_Finalize when i mpirun

@PHHargrove
Copy link
Member

@ggouaillardet my only configure option was --prefix=...

@ggouaillardet
Copy link
Contributor Author

i found something odd with OpenBSD m4

$10 is interpreted as $1 followed by the 0 character.
we expect this is interpreted as the tenth argument of the macro, and it seems that is a GNU extension.

though m4 is invoked with the -g option, that does not change anything.

the following program can be used to evidence the issue

define(`print_first_and_tenth', `echo $1 $10')

print_first_and_tenth(`one', `two', `three', `for', `five', `six', `seven', `eight', `nine', `ten')

bottom line, autogen.pl should not be used on OpenBSD 6.0 :-(

@PHHargrove
Copy link
Member

@ggouaillardet

GNU autotools and OpenBSD's m4 don't get along.
I always install gnu m4 and set M4=gm4 in my environment.
Then autogen.pl works fine for me.

@PHHargrove
Copy link
Member

PHHargrove commented Sep 23, 2016

@ggouaillardet

I also see a problem within Finalize, now that I've configured with --disable-dlopen.
What I see is a call to pthread_mutex_lock() with a pointer-to-lock value of 0xdfdfdfdfdfdfdfdf.
That pattern indicates use-after free(), because man malloc.conf says that 0xdf is the value used to over-write memory when freed.
If I set MALLOC_OPTIONS=j in my environment to turn off writing of this pattern, then the error vanishes.

(gdb) where
#0  _rthread_mutex_lock (mutexp=0xdfdfdfdfdfdfdfdf, trywait=0, abstime=0x0)
    at /usr/src/lib/librthread/rthread_sync.c:100
#1  0x000006154e62b946 in opal_libevent2022_event_del (ev=0x6154eb1afa0)
    at ../../../../../../ompi/opal/mca/event/libevent2022/libevent/event.c:2209
#2  0x000006154e6c406e in pdes () from /home/phargrov/OMPI/PR2110/INST/lib/libopen-pal.so.0.0
#3  0x000006154e685b6a in PMIx_Finalize () from /home/phargrov/OMPI/PR2110/INST/lib/libopen-pal.so.0.0
#4  0x000006154e672c3a in pmix3x_client_finalize ()
   from /home/phargrov/OMPI/PR2110/INST/lib/libopen-pal.so.0.0
#5  0x000006162dbb9229 in rte_finalize () from /home/phargrov/OMPI/PR2110/INST/lib/libopen-rte.so.0.0
#6  0x000006162db65078 in orte_finalize () from /home/phargrov/OMPI/PR2110/INST/lib/libopen-rte.so.0.0
#7  0x00000615be733511 in ompi_mpi_finalize () from /home/phargrov/OMPI/PR2110/INST/lib/libmpi.so.0.0
#8  0x000006133bc00d1d in main (argc=1, argv=0x7f7fffff42f8) at hello_c.c:24

@ggouaillardet
Copy link
Contributor Author

@rhc54 can you please comment on the following

event_base_free is first invoked here and free base->th_base_lock

bt1

and then event_del uses base->th_base_lock (after it was free'd earlier) here

bt2

the simple patch below fixes that, can you please double check it ?

diff --git a/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client.c b/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client.c
index 4d19308..cfa0421 100644
--- a/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client.c
+++ b/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client.c
@@ -506,6 +506,7 @@ PMIX_EXPORT pmix_status_t PMIx_Finalize(const pmix_info_t info[], size_t ninfo)
                              "pmix:client finalize sync received");
      }

+     PMIX_DESTRUCT(&pmix_client_globals.myserver);
      if (!pmix_globals.external_evbase) {
         #ifdef HAVE_LIBEVENT_GLOBAL_SHUTDOWN
             libevent_global_shutdown();
@@ -514,7 +515,6 @@ PMIX_EXPORT pmix_status_t PMIx_Finalize(const pmix_info_t info[], size_t ninfo)
      }

      pmix_usock_finalize();
-     PMIX_DESTRUCT(&pmix_client_globals.myserver);
      PMIX_LIST_DESTRUCT(&pmix_client_globals.pending_requests);

      if (0 <= pmix_client_globals.myserver.sd) {

@ggouaillardet
Copy link
Contributor Author

@jsquyres can you please review ?
The commit that escapes * in fortran is optional since

  • this is only for non GNU m4
  • non GNU m4 cannot be used for autogen.pl anyway, a blocker is $10 that is not interpreted as we expect
    on Linux, the generated configure is identical with or without this commit, so it does not hurt

@rhc54
Copy link
Contributor

rhc54 commented Sep 23, 2016

The PMIx change looks good to me - please upstream as well. Thx!

@ggouaillardet
Copy link
Contributor Author

@rhc54 i made openpmix/openpmix#155 but only for v2.x

master does things differently, and the fix might not be needed.

@rhc54
Copy link
Contributor

rhc54 commented Sep 23, 2016

Thanks! I'll take a look at master and see

autogen.pl Outdated
verbose "Checking whether m4 supports the gnu extensions\n";
my $m4 = $ENV{'M4'};
$m4 = "m4" unless (defined $m4);
my $result = `$m4 -g config/check_m4_gnu_extension.m4`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to direct stderr to /dev/null (or better yet, config.log, or even better yet, use OPAL_LOG_COMMAND -- see opal_functions.m4)? Otherwise, if something goes wrong, we'll see spurrious output in configure stdout / you can't see what happened by examining config.log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is autogen.pl, OPAL macros cannot be used, and there is no config.log at this stage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I'm a dope -- you're totally right.

autogen.pl Outdated
$m4 = "m4" unless (defined $m4);
my $result = `$m4 -g config/check_m4_gnu_extension.m4`;
if ($? != 0) {
$result = `$m4 config/check_m4_gnu_extension.m4`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to check the exit status of this command, too (e.g., use OPAL_LOG_COMMAND here, too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rationale here is m4 -g fails on OSX, so we have to try again m4
if the second command fails, then $result will not contain the expected value, so we do not really care of the exit code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point here was that we should check here if m4 fails, too. The first time you call m4, you can expect it to fail for the "on the wrong OS" reason. But m4 could fail for some unexpected reason, too. And that m4 failure output may not go to stdout, making troubleshooting quite difficult. So it would be good to check here and fail nicely if m4 fails for an unexpected reason.

autogen.pl Outdated
\"$result\", expected value is \"ten\", and known unexpected value is \"one0\"

=================================================================\n";
my_exit(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you really need two different error messages here -- if you care, you could combine them into a single error message (I don't feel strongly; it is just slightly more code to support).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i reworked that part and used a single error message
(and chomp too ...)

@jsquyres jsquyres removed this from the v2.1.0 milestone Sep 26, 2016
@ggouaillardet ggouaillardet force-pushed the topic/openbsd6_fixes branch 2 times, most recently from 877fa5f to 5eb6c54 Compare September 27, 2016 02:42
@ggouaillardet
Copy link
Contributor Author

:bot:mellanox:retest

2 similar comments
@ggouaillardet
Copy link
Contributor Author

:bot:mellanox:retest

@artpol84
Copy link
Contributor

:bot:mellanox:retest

autogen.pl Outdated
if you are running on OpenBSD, you might want to
export M4=gm4
and run autogen.pl again
=================================================================\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest this:

I need an m4 that supports GNU extensions; sorry!  I am gonna abort.

    $cmd

returned \"$result\", but I need it to return \"ten\".  If you are 
running on OpenBSD (e.g., if the result is \"one0\"), you might want to

export M4=gm4

and try running autogen.pl again.

(word wrap that above paragraph appropriately; it's hard to tell ~72 chars here in the github web UI)

@ggouaillardet
Copy link
Contributor Author

i made the requested changes

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Signed-off-by line to this PR's commit.

@bwbarrett
Copy link
Member

@ggouaillardet, do you still want to do something with this patch?

@ggouaillardet
Copy link
Contributor Author

@bwbarrett i think this should land into master.
right now, the FreeBSD is failing the test (because it is not using GNU m4, and $10 is interpreted as $1 followed by the 0 character).
can you please update the FreeBSD box, install gm4 and export M4=gm4 before invoking autogen.pl ?

@ggouaillardet ggouaillardet changed the title misc OpenBSD 6.0 related fixes misc BSD related fixes Sep 15, 2017
@jsquyres
Copy link
Member

@bwbarrett Yeah, it looks like this PR is intentionally causing the FreeBSD CI build to fail, because it's not doing what we expect with $10 in m4 code. @ggouaillardet's suggestion of using GNU m4 and exporting M4=gm4 might be Good Enough.

@bwbarrett
Copy link
Member

Is there no other solution than to require GNU m4? That's going to be a major pain in the butt to support long term, and I'm not convinced is the right thing. It seems much more friendly to require only the bits the base system provides. I'm pretty strongly against this patch until we show why we need GNU extensions as the only solution.

@jsquyres
Copy link
Member

FWIW, I see at least a few places where we use $10 in the config dir:

$ ack '\$10'
opal_check_package.m4
211:          [$10])

opal_mca.m4
611:    AS_IF([test "$should_build" = "1"], [$9], [$10])

opal_setup_component_package.m4
92:          [$10],

I haven't looked to see if a) we can use less than 10 params in those places, or b) if there's a way non-GNU-m4 implementations accept >=10 parameters.

@ggouaillardet
Copy link
Contributor Author

The issue is with autogen.pl
So technically, we can build Open MPI on BSD from a tarball (generated with GNU autotools) but not from git. I am pretty sure this is not jenkins friendly (mtt can be just fine with that)
Keep in mind that without GNU autotools, Open MPI will likely build, but the build is very likely incorrect, and we had to find it the hard way by debugging a runtime issue that made zero sense at first.

@PHHargrove
Copy link
Member

@jsquyres wrote:

I haven't looked to see if a) we can use less than 10 params in those places, or b) if there's a way non-GNU-m4 implementations accept >=10 parameters.

With respect to (b) the answer is YES.
But first some important background...

Here is some related text from the POSIX spec for m4:

Arguments are positionally defined and referenced. The string "$1" in the defining text shall be replaced by the first argument. Systems shall support at least nine arguments; only the first nine can be referenced, using the strings "$1" to "$9", inclusive.

More importantly, here is some text from the GNU m4 docs:

GNU m4 allows the number following the ‘$’ to consist of one or more digits, allowing macros to have any number of arguments. The extension of accepting multiple digits is incompatible with POSIX, and is different than traditional implementations of m4, which only recognize one digit. Therefore, future versions of GNU M4 will phase out this feature. To portably access beyond the ninth argument, you can use the argn macro documented later (see Shift).

The last two sentences of that quote:

  1. Warn that GNU m4 will cease to support greater than 9 args in the future
  2. Reference a solution to the problem of accessing args past the 9th.

Here is the argn macro given by the "(see Shift)" I quoted above:

— Composite: argn (n, ...)
Expands to argument n out of the remaining arguments. n must be a positive number. Usually invoked as ‘argn(`n',$@)’.

It is implemented as:

     define(`argn', `ifelse(`$1', 1, ``$2'',
       `argn(decr(`$1'), shift(shift($@)))')')

The complete page for "Shift", with the argn macro appearing at the very end:
https://www.gnu.org/software/m4/manual/m4-1.4.14/html_node/Shift.html

@jsquyres
Copy link
Member

@ggouaillardet Can you take a run at trying argn and/or its friends?

otherwise autogen.pl fails on OpenBSD 6.0

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
otherwise autogen.pl fails on OpenBSD 6.0

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
$10 is interpreted as the tenth argument by a M4 GNU extension,
and is interpreted as the first argument followed by the '0' character otherwise
(a notable example are BSD systems).

Since we do not really need more than nine parameters per macro, revamp
a bit of configury and make non GNU M4 happy pandas.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
@ggouaillardet
Copy link
Contributor Author

i could not get argn to work (very likely because i am a M4 newbie)

anyway, we do not need more than 9 arguments per macro, so i revamped the code to make all M4 happy pandas.

@PHHargrove
Copy link
Member

@ggouaillardet
Regarding argn it is my understanding that GNU m4 provides this macro, but others do not.
However, the 2-line implementation of that macro in the GNU m4 docs is portable.

Of course reducing all macros to no more than 9 args should work too!

# [eval if should build],
# [eval if should not build])
# [set to 1 if should build],
# [set to 0 if should not build])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right -- the comment implies that there are 10 parameters. Could you clarify?

It might also be worth mentioning in a comment for each of the macros why there are 9 params instead of 10 (because it's a departure from the m4/Open MPI convention of "eval this if true, eval that if false").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is dead wrong, will fix

dnl -*- shell-script -*-
dnl
dnl Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
dnl University Research and Technology
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file looks like it was removed because it is unused. Yay!

But it should probably be a separate commit, just to show that it was removed because it's dead code -- not part of the revamp to go from 10 m4 params to 9 m4 params.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, wil do

@@ -208,7 +205,7 @@ AC_DEFUN([OPAL_CHECK_PACKAGE],[
[opal_check_package_happy="yes"],
[opal_check_package_happy="no"])],
[opal_check_package_happy="no"],
[$10])
[])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check that in all places that we call OPAL_CHECK_PACKAGE from that we don't need any extra includes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, i did check that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. It is safe to remove the includes param from OPAL_CHECK_PACKAGE_HEADER(), then? Or do other places call it that need includes?

@bwbarrett
Copy link
Member

I'm not sure we're going down the right path; give me some time to look at the argn macro before we give up on that path; should be later this week.

@hppritcha
Copy link
Member

@jsquyres should we close this?

@bwbarrett bwbarrett self-assigned this Mar 21, 2018
@rhc54
Copy link
Contributor

rhc54 commented Aug 28, 2018

Given how out-of-date this is and the lack of any movement in six months, let's just drop it.

@rhc54 rhc54 closed this Aug 28, 2018
@jsquyres
Copy link
Member

Yo @bwbarrett -- your last comment was:

I'm not sure we're going down the right path; give me some time to look at the argn macro before we give up on that path; should be later this week.

Do you still care about m4 argn support for when we have >=10 m4 parameters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants