-
Notifications
You must be signed in to change notification settings - Fork 68
btl/openib: fix segmentation fault #1248
Conversation
This commit fixes a segmentation fault that occurs if a device can be initialized but not used. In this case the devices_count is not equal to the number of usable devices in the devices pointer array. Thanks to @artpol84 for tracking this down. Fixes open-mpi/ompi#1823 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from open-mpi/ompi@8128c8e) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
|
:bot🏷️bug Fixes an issue triggered by mlx jenkins. |
|
@jsquyres i'm good with merging in once CI is complete |
|
@jsquyres Looks like that may be a different SEGV. |
|
Arrgh. It's a timeout. |
|
Ok, then that is the rdmacm finalize hang I think. My change doesn't fix that one. EDIT: Nathan meant "rdmacm". |
|
Do we have any idea what that hang in finalize is? |
|
pretty sure @hjelmn meant "rdcm", not "rdma" - we've been seeing it for awhile. |
|
No idea unfortunately. @artpol84 If you have time can you see if you can find the rdmacm finalize hang? I will take a look on Friday if you can't get to it. |
|
Yup. rdmacm. |
|
Test FAILed. |
|
An explanation of the mlnx jenkins failure is required before merging this PR. |
|
Only thing I can determine is rdmacm (or the clear-to-send connection protocol in btl/openib) is not thread safe. This set of commits fixed the non-threaded case and and incorrect locking in the threaded case. For 2.0.1 we are waiting on someone to fix the threaded case. |
|
👍 |
This commit fixes a segmentation fault that occurs if a device can be
initialized but not used. In this case the devices_count is not equal
to the number of usable devices in the devices pointer array.
Thanks to @artpol84 for tracking this down.
Fixes open-mpi/ompi#1823
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov
(cherry picked from open-mpi/ompi@8128c8e)
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov