Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Feb 17, 2016

Found several regressions in XRC support not identified by Mellanox Jenkins. This PR fixes all the regressions identified with libibverbs 1.1.8.

:bot:assign: @artpol84
:bot🏷️bug
:bot:milestone:v2.0.0

Commit open-mpi/ompi@400af6c
introduced a regression in XRC support. The commit reversed the
ordering of shared receive queue (SRQ) and completion queue (CQ)
completion. CQ creation must always preceed SRQ creation when using
XRC as the CQs are needed to create the SRQs. This commit fixes the
ordering so that CQs are always created before SRQs.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from open-mpi/ompi@123a39a)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
mca_btl_openib_put incorrectly checks the qp inline max before
allowing an inline put. This check will always fail for an endpoint
that has not been connected. The commit changes the check to use the
btl_put_local_registration_threshold instead.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from open-mpi/ompi@201c280)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes two bugs in XRC support

 - When dynamic add_procs support was added to master the remote
   process name was added to the non-XRC request structure. The same
   value was not added to the XRC xconnect structure. This error was
   not caught because the send/recv code was incorrectly using the
   wrong structure member. This commmit adds the member and ensure the
   xconnect code uses the correct structure.

 - XRC loopback QP support has been fixed. It was 1) not setting the
   correct fields on the endpoint structure, 2) calling
   udcm_xrc_recv_qp_connect, and 3) was not initializing the endpoint
   data.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from open-mpi/ompi@bf83603)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@ompiteam-bot ompiteam-bot added this to the v2.0.0 milestone Feb 17, 2016
@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1347/ for details.

@hppritcha
Copy link
Member

What's a good test to run that verifies that this fixes XRC issues?

@hjelmn
Copy link
Member Author

hjelmn commented Feb 17, 2016

Running ibm/onesided with XRC enabled should do it. I enabled XRC on mustang by adding -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512.

Mellanox really should run some of their Jenkins tests with XRC.

@hjelmn
Copy link
Member Author

hjelmn commented Feb 17, 2016

Found one more little bug. PR is now on master open-mpi/ompi#1373. Will add to the PR once it is merged.

@artpol84
Copy link
Contributor

I'll review till the end of the week and will get back to you.

This commit ensures ib_addr->remote_xrc_rcv_qp_num value is set when
creating the loopback queue pair. This is needed when communicating
with any other local peer.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit open-mpi/ompi@4f4ea96)

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
@hjelmn
Copy link
Member Author

hjelmn commented Feb 18, 2016

@artpol84 I opened a PR to add XRC testing to the Mellanox Jenkins. The testing should check everything but the loopback queue pair usage. That needs to be tested with a one-sided test and osc/rdma.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1348/ for details.

@hjelmn
Copy link
Member Author

hjelmn commented Feb 18, 2016

Mellanox accepted my patch to add XRC to their Jenkins.

:bot:retest:

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1350/ for details.

@hjelmn
Copy link
Member Author

hjelmn commented Feb 18, 2016

Looks like the Mellanox Jenkins ran successfully with the XRC check. That should prevent future regressions.

hjelmn and others added 3 commits February 18, 2016 22:05
This commit fixes a bug that occurs when attempting a get or put
operation on an endpoint that is not already connected. In this case
the remote_srqn may be set to an invalid value as the rem_srqs array
on the endpoint is not populated. This commit moves the usage of the
rem_srqs array to the internal put/get functions where it is
guaranteed this array is populated.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit open-mpi/ompi@4dc73d7)

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
This bug fixes two issue with the ib_addr lock:

 - The ib_addr lock must always be obtained regardless of
   opal_using_threads() as the CPC is run in a seperate thread.

 - The ib_addr lock is held in mca_btl_openib_endpoint_connected when
   calling back into the CPC start_connect on any pending
   connections. This will attempt to obtain the ib_addr lock
   again. Since this is not a performance-critical part of the code
   the lock has been changed to be recursive.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit open-mpi/ompi@371df45)

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
This commit fixes a bug that can occur when communicating via XRC to
peers on the same node. UDCM was not saving the SRQ numbers on the
loopback endpoint (which shares its ib_addr info with all local peers)
so any messages to local peers use an invalid SRQ number.

Fixes open-mpi/ompi#1383

Signed-off-by: Nathan Hjelm <hjelmn@me.com>

(cherry picked from commit open-mpi/ompi@2031bb6)

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
@hjelmn
Copy link
Member Author

hjelmn commented Feb 19, 2016

Think I got all the XRC bugs. Running completely clean for me now. If it passes the Mellanox jenkins should be good to go after @artpol84 review.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1354/ for details.

@artpol84
Copy link
Contributor

All seems fine to me.
Note that I'm only briefly familiar with XRC implementation, so I may be missing something.

@artpol84
Copy link
Contributor

👍

hppritcha added a commit that referenced this pull request Feb 23, 2016
@hppritcha hppritcha merged commit 81f5f89 into open-mpi:v2.x Feb 23, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants