Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Jun 28, 2016

This commit fixes a segmentation fault that occurs if a device can be
initialized but not used. In this case the devices_count is not equal
to the number of usable devices in the devices pointer array.

Thanks to @artpol84 for tracking this down.

Fixes #1823

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

This commit fixes a segmentation fault that occurs if a device can be
initialized but not used. In this case the devices_count is not equal
to the number of usable devices in the devices pointer array.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi#1823

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@jsquyres
Copy link
Member

v2.0.0 or v2.0.1?

@hjelmn
Copy link
Member Author

hjelmn commented Jun 28, 2016

2.0.0. no reason to hold this one back

@hjelmn
Copy link
Member Author

hjelmn commented Jun 28, 2016

This should put an end to mlx Jenkins failures. this is a long standing bug but it seems to have been triggered by a hardware change on the mlx Jenkins systems

@jsquyres
Copy link
Member

@hppritcha I am good with 2.0.0 for this. You?

@jsquyres
Copy link
Member

I've seen this false failure on the Mellanox jenkins before -- not sure why it happens:

make[2]: *** No rule to make target `../config/depcomp', needed by `distdir'.  Stop.

a bot:retest seems to fix it...

@jladd-mlnx
Copy link
Member

Seeing this when trying to autogen the source on Jenkins server.

$./autogen.pl 
autogen.pl has been invoked in the source tree of an Open MPI distribution tarball; aborting...
You likely do not need to invoke "autogen.pl" -- you can probably run "configure" directly.
If you really know what you are doing, and really need to run autogen.pl, use the "--force" flag. at ./autogen.pl line 98

@jsquyres
Copy link
Member

autogen.pl checks the VERSION file to see if repo_rev is empty.

  • If it's empty, then it's a git clone, and we're good to go.
  • If it's not empty, then you're building from a tarball, and it issues that warning.

Is VERSION's repo_rev somehow not empty for you?

@hjelmn
Copy link
Member Author

hjelmn commented Jun 28, 2016

:bot:retest:

@hjelmn
Copy link
Member Author

hjelmn commented Jun 28, 2016

:bot:retest

@jladd-mlnx
Copy link
Member

@jsquyres Oddly enough

# If repo_rev is empty, then the repository version number will be
# obtained during "make dist" via the "git describe --tags --always"
# command, or with the date (if "git describe" fails) in the form of
# "date<date>".

repo_rev=dev-4337-g806b0d7

@jsquyres
Copy link
Member

@jladd-mlnx Yeah, that's odd. Huh. Any idea where that's coming from? It's not set that way in the repo:

$ hub checkout https://github.com/open-mpi/ompi/pull/1824
Updating hjelmn
remote: Counting objects: 21388, done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 21388 (delta 7496), reused 7495 (delta 7495), pack-reused 13885
Receiving objects: 100% (21388/21388), 7.84 MiB | 2.74 MiB/s, done.
Resolving deltas: 100% (17354/17354), completed with 2939 local objects.
From git://github.com/hjelmn/ompi
 * [new branch]      rdmacm_fix -> hjelmn/rdmacm_fix
Checking out files: 100% (8613/8613), done.
Branch hjelmn-rdmacm_fix set up to track remote branch rdmacm_fix from hjelmn.
Switched to a new branch 'hjelmn-rdmacm_fix'
$ grep repo_rev VERSION
# If repo_rev is empty, then the repository version number will be
repo_rev=

@jsquyres
Copy link
Member

@jladd-mlnx It looks like your jenkins is trying to run "make dist", which would have filled in the repo_rev field (thereby making future invocations of autogen.pl think it was running in a tarball). You should probably git clean -dfx; git checkout . to get a completely clean tree and try again.

@jsquyres
Copy link
Member

@hppritcha @hjelmn I'm tempted to merge this anyway. It fixes the known bug, and we'd like to get this a PR for v2.x for this before tonight's MTT so that we can make the rc tomorrow. Thoughts?

@hppritcha
Copy link
Member

Lets go ahead and merge and get over to 2.x in time for mtt testing tonite.

@jsquyres jsquyres merged commit f18d660 into open-mpi:master Jun 28, 2016
@jsquyres
Copy link
Member

@hjelmn Can you make a v2.x PR? Thanks.

@mike-dubman
Copy link
Member

@jsquyres -

git clean -dfx; git checkout . - jenkins does series of builds, cleanup during its run. are there any incompatible build&&clean&build sequences that triggering such situations?

@jsquyres
Copy link
Member

@miked-mellanox That's not really what I'm saying. The Mellanox Jenkins does a make dist right after running configure. Sometimes, the Mellanox Jenkins dies during make dist because of this error:

make[2]: *** No rule to make target `../config/depcomp', needed by `distdir'.  Stop.

When that happens, the tree is in an inconsistent state (i.e., VERSION is dirty, because it was in the middle of make dist). If you want to try to build that tree manually, you need to clean it / put it back in a consistent state, and then you can try again. If you just try to run autogen.pl again, it's going to fail the way @jladd-mlnx noted because the tree is in an inconsistent state. This is expected/normal.

FWIW, I've seen errors like that in two general kinds of cases:

  1. when the build is occurring on a network filesystem and there are time synchronization issues between the local machine and the filesystem server
  2. when some other agent is operating on the same directory tree and touches files / changes timestamps at the same time (and potentially even deleting files) -- e.g., if somehow, some other process is also running `configure in the same tree

@mike-dubman
Copy link
Member

ohh.. i see. thanks for clarifications.

jenkins runs on local fs and workspace is per job, should not be any sharing of workspace.

No rule to make target `../config/depcomp', needed by `distdir'.  Stop.

depcomp seems like present in some make dependencies but unexpectedly removed, probably by autogen.pl?

$git grep depcomp
.gitignore:depcomp
.hgignore_global:depcomp
autogen.pl:find_and_delete(qw/config.guess config.sub depcomp compile install-sh ltconfig
contrib/code_counter.pl:                   "libtool", "depcomp", "aclocal.m4", "install-sh",
contrib/hg/build-hgignore.pl:                  depcomp
opal/mca/hwloc/hwloc191/hwloc/config/depcomp:# depcomp - compile a program generating dependencies as side-effects
opal/mca/hwloc/hwloc191/hwloc/config/depcomp:Usage: depcomp [--help] [--version] PROGRAM [ARGS]
opal/mca/hwloc/hwloc191/hwloc/config/depcomp:    echo "depcomp $scriptversion"
opal/mca/hwloc/hwloc191/hwloc/config/depcomp:  echo "depcomp: Variables source, object and depmode must be set" 1>&2

@jsquyres
Copy link
Member

jsquyres commented Jun 29, 2016

Local filesystem: ok, good -- that should obviate any possibility of timestamp / dependency issues.

depcomp (and several others) are generated files. Running autogen.pl generates them. Specifically: in the beginning of autogen.pl, we remove old / stale versions that may be lying around in the tree. Then we run all the autotools, and depcomp (and several others) should be generated. For example:

$ cd path_of_ompi_clone
$ git clean -dfx
# ...lots of output...
$ find . -name depcomp
# Note that no copies of depcomp are found, because we just cleaned the tree

# Now we run autogen
$ ./autogen.pl >& auto.out

# Now look for depcomp again
$ find . -name depcomp
./config/depcomp
./opal/mca/event/libevent2022/libevent/depcomp
./opal/mca/pmix/pmix2x/pmix/config/depcomp
./ompi/mca/io/romio314/romio/confdb/depcomp

If depcomp is not generated during autogen.pl, then something likely went wrong when you ran autogen.pl (e.g., an error in running the GNU Autotools?).

Indeed, in the Mellanox Jenkins output, I can see that autogen.pl claimed to have successfully generated depcomp:

12:40:30 ompi/Makefile.am: installing 'config/depcomp'

So it's an oddity that autogen.pl is generating depcomp, but then it's missing slightly later in your Jenkins job. ...although I do see this line in the Jenkins output:

13:07:59 cp: skipping file `/hpc/local/share/automake-1.15/depcomp', as it was replaced while being copied

This might suggest that something else is going on on the server (e.g., /hpc/local/share/automake-1.15/ itself is being edited)...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RDMACM failures in Mellanox

5 participants