Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc/rdma v3.1.3rc2: PR #5924 seems to introduce segfaults on simple tests #5969

Closed
bsergentm opened this issue Oct 24, 2018 · 14 comments
Closed

Comments

@bsergentm
Copy link
Contributor

bsergentm commented Oct 24, 2018

Background information

What version of Open MPI are you using?

v3.1.3rc2, commit 6187b7a
Open MPI repo revision from ompi_info --all : v3.1.3rc1-2-gfa3d929

Describe how Open MPI was installed

./configure '--prefix=/homes/sergentm/ompi-installdir' '--with-devel-headers' '--enable-binaries' '--enable-mpi-cxx' '--enable-mpirun-prefix-by-default' '--with-pmi=/homes/sergentm/export/slurm/usr' '--with-lustre=/opt/lustre_210_for_deps/usr' '--enable-wrapper-rpath=no' '--enable-wrapper-runpath=no' '--with-knem=/homes/sergentm/export/knem_dir/opt/knem-1.1.3.90mlnx1' '--with-cma' '--with-portals4=no' '--with-verbs=/usr' '--with-verbs-libdir=/usr/lib64' '--disable-openib-dynamic-sl' '--with-mxm=/opt/mellanox/mxm' '--with-ucx=no' 'CFLAGS=-g -pipe -m64' 'CXXFLAGS=-g -pipe -m64' 'FCFLAGS=-g -pipe -m64'

Please describe the system on which you are running

  • Operating system/version:
    Linux 3.10.0-693.21.1.el7.x86_64 BTL checkpoint friendly #1 SMP Fri Feb 23 18:54:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Computer hardware:
    Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
  • Network type:
    InfiniBand 100 Gb/sec (4X EDR)

More details on file "env.txt" attached here : v3.1.x_osc-rdma_segment-register.tar.gz


Details of the problem

The commit 4fbe078 : "RDMA OSC: initialize segment memory before registering the segment" of PR #5924 seems to introduce a regression on MPI-RMA applications which uses the Passive Target mode (lock/unlock).

I attached to this issue a simple reproducer of the issue. I launched it on the described platform both with srun and mpirun and have the same issue.

shell$ mpirun -np 4 --map-by ppr:2:node -H node1,node2 ./simple_test
mlx5: node1: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000004 00000000 00000000 00000000
00000000 00008813 08028cc4 0001f0d2
[[63443,0],0][btl_openib_component.c:3644:handle_wc] from node1 to: node1 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id e4c798 opcode 1  vendor error 136 qp_idx 3
Process 0 puts message 10 to neighbor 1 (4 processes in ring)
[node1:159035:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
 2 0x00000000000687bc mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.6.3104/src/mxm/util/debug/debug.c:641
 3 0x0000000000068d0c mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.6.3104/src/mxm/util/debug/debug.c:616
 4 0x0000000000035270 killpg()  ??:0
===================

The issue seems to happen when 2 process or more are spawned on a node, and when at least two nodes are used.
When looking at this commit 4fbe078, I found that the synchronization between peers part (shared_comm->c_coll->coll_barrier) have been moved before the ompi_osc_rdma_register call. When I tried to put it back as before the commit, it worked again.

I do not know if the fix I propose here is compliant with what this commit intended to do at the beginning, so I prefer to open an issue instead of providing a partial revert pull request.

PS: removing the MPI_Put() call from the simple reproducer I provided gives another error :

[node1:162962] too many retries sending message to 0x0022:0x00016f67, giving up
--------------------------------------------------------------------------
The OpenFabrics driver in Open MPI tried to raise a fatal error, but
failed.  Hopefully there was an error message before this one that
gave some more detailed information.

  Local host: node1
  Source file: btl_openib_endpoint.c
  Source line: 1037

Your job is now going to abort, sorry.
--------------------------------------------------------------------------

This issue may have other visible effects. As I looked for recently open issues, I think it might also be related to #5946.

@bsergentm bsergentm changed the title osc/rdma v3.1.3rc2: commit 4fbe078 seems to introduce segfaults on simple tests osc/rdma v3.1.3rc2: PR https://github.com/open-mpi/ompi/pull/5923 seems to introduce segfaults on simple tests Oct 24, 2018
@bsergentm bsergentm changed the title osc/rdma v3.1.3rc2: PR https://github.com/open-mpi/ompi/pull/5923 seems to introduce segfaults on simple tests osc/rdma v3.1.3rc2: PR #5923 seems to introduce segfaults on simple tests Oct 24, 2018
@jsquyres jsquyres changed the title osc/rdma v3.1.3rc2: PR #5923 seems to introduce segfaults on simple tests osc/rdma v3.1.3rc2: PR #5924 seems to introduce segfaults on simple tests Oct 24, 2018
@jsquyres
Copy link
Member

Update: #5923 is the v4.0.x version of the PR that seems to have caused the issue. #5924 is the v3.1.x version of the same PR, and #5925 is the v3.0.x version of the same PR. I'm therefore assuming that this same issue arises on all 3 branches.

The PR was merged on the v3.0.x and v3.1.x branches, but has not yet (as of 24 Oct 2018) been merged on the v4.0.x branch.

@hjelmn
Copy link
Member

hjelmn commented Oct 24, 2018

Hmmm... This is interesting. Maybe we need a second barrier? Won't really hurt much.

@hjelmn
Copy link
Member

hjelmn commented Oct 24, 2018

I can throw together a PR and see if it helps. If not the problem might be more complicated.

@devreal
Copy link
Contributor

devreal commented Oct 24, 2018

I'm looking into it atm (as that PR came from me). Sorry for the issues

@hjelmn
Copy link
Member

hjelmn commented Oct 24, 2018

no problem. didn't think it would cause any problems when I looked at it.

@devreal
Copy link
Contributor

devreal commented Oct 24, 2018

I think you're right, we would need another barrier in line 660 (where it was previously) to make sure that state_region->btl_handle_data is properly set.

@hjelmn
Copy link
Member

hjelmn commented Oct 24, 2018

Ok, if you want to open a PR for that go ahead otherwise I will post one this evening.

@devreal
Copy link
Contributor

devreal commented Oct 24, 2018

I'm waiting for our system to come back to verify using the reproducer attached and will post a PR after that.

bwbarrett added a commit to bwbarrett/ompi that referenced this issue Oct 26, 2018
…gment"

This reverts commit 4f435e8.

Issue open-mpi#5969 was caused by this commit and the 4.0.x release
managers decided not to take this patch, so having it in
3.0.x and 3.1.x is a bit awkward.
bwbarrett added a commit to bwbarrett/ompi that referenced this issue Oct 26, 2018
…gment"

This reverts commit 4fbe078.

Issue open-mpi#5969 was caused by this commit and the 4.0.x release
managers decided not to take this patch, so having it in
3.0.x and 3.1.x is a bit awkward.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this issue Oct 26, 2018
…gment"

This reverts commit 4f435e8.

Issue open-mpi#5969 was caused by this commit and the 4.0.x release
managers decided not to take this patch, so having it in
3.0.x and 3.1.x is a bit awkward.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
@bwbarrett
Copy link
Member

We reverted the offending patch from 3.0.x and 3.1.x release branches (and they've never shipped in a release). Removing the tags for those branches.

@hjelmn
Copy link
Member

hjelmn commented Oct 26, 2018

@devreal This gives us a little more time to see if there is a better solution to ensuring locality. I still want the issue fixed but it needs just a little more time.

@devreal
Copy link
Contributor

devreal commented Oct 26, 2018

I agree, maybe this was too much a shot from the hip. I will continue working on the issue and hopefully we can bring it into the next release.

@gpaulsen
Copy link
Member

@bwbarrett Since you reverted those commits, can we close this issue?

@bwbarrett
Copy link
Member

I only reverted for 3.0.x and 3.1.x. My understanding was that there was still a 4.0.x problem, but I didn't really look at it in any depth (that's why I left the 4.x label). So I think closing is up to the 4.0.x RMs.

@gpaulsen
Copy link
Member

Should be fixed in v5.0.x/master. Closing. Please reopen if you notice this issue today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants