Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Jun 28, 2016

This commit fixes a segmentation fault that occurs if a device can be
initialized but not used. In this case the devices_count is not equal
to the number of usable devices in the devices pointer array.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi/ompi#1823

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

(cherry picked from open-mpi/ompi@8128c8e)

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

This commit fixes a segmentation fault that occurs if a device can be
initialized but not used. In this case the devices_count is not equal
to the number of usable devices in the devices pointer array.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi/ompi#1823

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from open-mpi/ompi@8128c8e)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Jun 28, 2016

:bot🏷️bug
:bot:milestone:v2.0.0
:bot:assign: @artpol84

Fixes an issue triggered by mlx jenkins.

@hppritcha
Copy link
Member

@jsquyres i'm good with merging in once CI is complete

@jsquyres
Copy link
Member

@hjelmn @artpol84 Sad panda -- the openib fix didn't seem to fix the segv seen in the Mellanox Jenkins runs. Might need further diagnosis...?

@hjelmn
Copy link
Member Author

hjelmn commented Jun 29, 2016

@jsquyres Looks like that may be a different SEGV.

@jsquyres
Copy link
Member

Arrgh. It's a timeout.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 29, 2016

Ok, then that is the rdmacm finalize hang I think. My change doesn't fix that one.

EDIT: Nathan meant "rdmacm".

@jsquyres
Copy link
Member

Do we have any idea what that hang in finalize is?

@rhc54
Copy link

rhc54 commented Jun 29, 2016

pretty sure @hjelmn meant "rdcm", not "rdma" - we've been seeing it for awhile.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 29, 2016

No idea unfortunately. @artpol84 If you have time can you see if you can find the rdmacm finalize hang? I will take a look on Friday if you can't get to it.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 29, 2016

Yup. rdmacm.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1818/ for details.

@hppritcha
Copy link
Member

An explanation of the mlnx jenkins failure is required before merging this PR.

@hjelmn
Copy link
Member Author

hjelmn commented Jul 5, 2016

Only thing I can determine is rdmacm (or the clear-to-send connection protocol in btl/openib) is not thread safe. This set of commits fixed the non-threaded case and and incorrect locking in the threaded case. For 2.0.1 we are waiting on someone to fix the threaded case.

@jsquyres
Copy link
Member

jsquyres commented Jul 5, 2016

👍

@jsquyres jsquyres merged commit ac508f5 into open-mpi:v2.x Jul 5, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RDMACM failures in Mellanox

7 participants