-
Notifications
You must be signed in to change notification settings - Fork 929
btl/openib: fix segmentation fault #1824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit fixes a segmentation fault that occurs if a device can be initialized but not used. In this case the devices_count is not equal to the number of usable devices in the devices pointer array. Thanks to @artpol84 for tracking this down. Fixes open-mpi#1823 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
|
v2.0.0 or v2.0.1? |
|
2.0.0. no reason to hold this one back |
|
This should put an end to mlx Jenkins failures. this is a long standing bug but it seems to have been triggered by a hardware change on the mlx Jenkins systems |
|
@hppritcha I am good with 2.0.0 for this. You? |
|
I've seen this false failure on the Mellanox jenkins before -- not sure why it happens: a bot:retest seems to fix it... |
|
Seeing this when trying to autogen the source on Jenkins server. |
|
autogen.pl checks the VERSION file to see if repo_rev is empty.
Is VERSION's repo_rev somehow not empty for you? |
|
:bot:retest: |
|
:bot:retest |
|
@jsquyres Oddly enough |
|
@jladd-mlnx Yeah, that's odd. Huh. Any idea where that's coming from? It's not set that way in the repo: $ hub checkout https://github.com/open-mpi/ompi/pull/1824
Updating hjelmn
remote: Counting objects: 21388, done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 21388 (delta 7496), reused 7495 (delta 7495), pack-reused 13885
Receiving objects: 100% (21388/21388), 7.84 MiB | 2.74 MiB/s, done.
Resolving deltas: 100% (17354/17354), completed with 2939 local objects.
From git://github.com/hjelmn/ompi
* [new branch] rdmacm_fix -> hjelmn/rdmacm_fix
Checking out files: 100% (8613/8613), done.
Branch hjelmn-rdmacm_fix set up to track remote branch rdmacm_fix from hjelmn.
Switched to a new branch 'hjelmn-rdmacm_fix'
$ grep repo_rev VERSION
# If repo_rev is empty, then the repository version number will be
repo_rev= |
|
@jladd-mlnx It looks like your jenkins is trying to run "make dist", which would have filled in the repo_rev field (thereby making future invocations of |
|
@hppritcha @hjelmn I'm tempted to merge this anyway. It fixes the known bug, and we'd like to get this a PR for v2.x for this before tonight's MTT so that we can make the rc tomorrow. Thoughts? |
|
Lets go ahead and merge and get over to 2.x in time for mtt testing tonite. |
|
@hjelmn Can you make a v2.x PR? Thanks. |
|
|
|
@miked-mellanox That's not really what I'm saying. The Mellanox Jenkins does a When that happens, the tree is in an inconsistent state (i.e., FWIW, I've seen errors like that in two general kinds of cases:
|
|
ohh.. i see. thanks for clarifications. jenkins runs on local fs and workspace is per job, should not be any sharing of workspace. No rule to make target `../config/depcomp', needed by `distdir'. Stop.depcomp seems like present in some make dependencies but unexpectedly removed, probably by autogen.pl? $git grep depcomp
.gitignore:depcomp
.hgignore_global:depcomp
autogen.pl:find_and_delete(qw/config.guess config.sub depcomp compile install-sh ltconfig
contrib/code_counter.pl: "libtool", "depcomp", "aclocal.m4", "install-sh",
contrib/hg/build-hgignore.pl: depcomp
opal/mca/hwloc/hwloc191/hwloc/config/depcomp:# depcomp - compile a program generating dependencies as side-effects
opal/mca/hwloc/hwloc191/hwloc/config/depcomp:Usage: depcomp [--help] [--version] PROGRAM [ARGS]
opal/mca/hwloc/hwloc191/hwloc/config/depcomp: echo "depcomp $scriptversion"
opal/mca/hwloc/hwloc191/hwloc/config/depcomp: echo "depcomp: Variables source, object and depmode must be set" 1>&2 |
|
Local filesystem: ok, good -- that should obviate any possibility of timestamp / dependency issues.
$ cd path_of_ompi_clone
$ git clean -dfx
# ...lots of output...
$ find . -name depcomp
# Note that no copies of depcomp are found, because we just cleaned the tree
# Now we run autogen
$ ./autogen.pl >& auto.out
# Now look for depcomp again
$ find . -name depcomp
./config/depcomp
./opal/mca/event/libevent2022/libevent/depcomp
./opal/mca/pmix/pmix2x/pmix/config/depcomp
./ompi/mca/io/romio314/romio/confdb/depcompIf depcomp is not generated during Indeed, in the Mellanox Jenkins output, I can see that autogen.pl claimed to have successfully generated So it's an oddity that This might suggest that something else is going on on the server (e.g., |
This commit fixes a segmentation fault that occurs if a device can be
initialized but not used. In this case the devices_count is not equal
to the number of usable devices in the devices pointer array.
Thanks to @artpol84 for tracking this down.
Fixes #1823
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov