Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc/rdma segmentation fault when using IB atomics (c_post_start) #1329

Closed
sjeaugey opened this issue Jan 26, 2016 · 5 comments
Closed

osc/rdma segmentation fault when using IB atomics (c_post_start) #1329

sjeaugey opened this issue Jan 26, 2016 · 5 comments
Assignees
Labels

Comments

@sjeaugey
Copy link
Member

The MTT test ibm/onesided/c_post_start still fails with segmentation fault when using new IB atomics.

Setting btl_openib_flags to 12599 makes the bug disappear.

The core I get is corrupted so I don't have a stack trace yet - will post it as soon as I get it.

@sjeaugey
Copy link
Member Author

Here is the stack trace (sorry no line number). The crash happens with only 2 ranks on rank 1.

#0  0x00007f439a74f5d0 in ?? ()
#1  <signal handler called>
#2  0x00007f4399d36386 in ompi_osc_rdma_complete_atomic () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/DIFa/install/lib/openmpi/mca_osc_rdma.so
#3  0x00007f43a2f86fe6 in PMPI_Win_complete () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/ompi-master--gcc_master--dev-3397-g70787d1/install/lib/libmpi.so.0
#4  0x0000000000401089 in main ()

@sjeaugey sjeaugey changed the title Segmentation fault when using IB atomics (c_post_start) osc/rdma segmentation fault when using IB atomics (c_post_start) Jan 26, 2016
@jsquyres
Copy link
Member

EDIT: Added triple-single-ticks to the verbatim section to make it easier to read.

@sjeaugey
Copy link
Member Author

Thanks Jeff.

Nathan, the crash does not always happen on rank 1, sometimes it is on rank 0 also.

@hjelmn
Copy link
Member

hjelmn commented Jan 26, 2016

My MTT was out of date on our clusters. Updated and now I see this issue. Will fix now.

hjelmn added a commit to hjelmn/ompi that referenced this issue Jan 26, 2016
The typo caused SEGVs on systems with only fetching atomic
support.

Fixes open-mpi#1329

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member

hjelmn commented Jan 26, 2016

Simple enough. There was a typo in the osc/rdma MPI_Win_start implementation. Fixed in the referencing PR.

hjelmn added a commit to hjelmn/ompi-release that referenced this issue Mar 8, 2016
The typo caused SEGVs on systems with only fetching atomic
support.

Fixes open-mpi/ompi#1329

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from open-mpi/ompi@a19c265)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
bosilca pushed a commit to bosilca/ompi that referenced this issue Oct 3, 2016
The typo caused SEGVs on systems with only fetching atomic
support.

Fixes open-mpi#1329

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants