Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Jun 30, 2016

This commit is an attempt to fix a hang in finalize of rdmacm. This fixes
a path where no rdmacm client is found for an endpoint.

Fixes open-mpi/ompi#1829

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

(cherry picked from commit open-mpi/ompi@960fcd2)

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

This commit is an attempt to fix a hang in finalize of rdmacm. This fixes
a path where no rdmacm client is found for an endpoint.

Fixes open-mpi/ompi#1829

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit open-mpi/ompi@960fcd2)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Jun 30, 2016

:bot🏷️bug
:bot:milestone:v2.0.0
:bot:assign: @bharatpotnuri

@ompiteam-bot
Copy link

OMPIBot error: User bharatpotnuri is not valid for issue 1251.

@ompiteam-bot ompiteam-bot added this to the v2.0.0 milestone Jun 30, 2016
@hjelmn
Copy link
Member Author

hjelmn commented Jun 30, 2016

@bharatpotnuri Please verify the fix and reply with :+1: to mark this as reviewed.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 30, 2016

bah, the bot pulled the +1 out of that

:bot:nolabel:reviewed

@hjelmn
Copy link
Member Author

hjelmn commented Jun 30, 2016

:bot:nolabel:pushed-back

@hjelmn hjelmn changed the title btl/openib: fix rdma hang btl/openib: fix rdmacm hang Jun 30, 2016
@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1822/ for details.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 30, 2016

Jenkins found a bug that looks like it has been in rdmacm for some time. It was hidden by us disabling the openib btl when thread multiple was in use. Should pass Jenkins now.

@bharatpotnuri
Copy link

commit ed2bba2 tests fine for iwarp and fixes the stall at MPI_Finalize.
Thanks @hjelmn !

@hjelmn
Copy link
Member Author

hjelmn commented Jun 30, 2016

👍

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1823/ for details.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1824/ for details.

@ibm-ompi
Copy link

Build Failed with GNU compiler! Please review the log, and get in touch if you have questions.

@ibm-ompi
Copy link

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

@jjhursey
Copy link
Member

I don't know why the IBM CI tests are not showing up in the list, but here are the failure logs:
https://gist.github.com/89fdbe8184b26859afcbe1d0528dfcac
https://gist.github.com/ibm-ompi/1fefe2a65545eea068ddc825ebd5561f

@hjelmn
Copy link
Member Author

hjelmn commented Jun 30, 2016

@jjhursey There was a typo that I corrected before the IBM results came back. Probably why they are not showing up.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1825/ for details.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1826/ for details.

@hjelmn
Copy link
Member Author

hjelmn commented Jul 1, 2016

Mellanox failure is fixed by #1249. Both together should make Jenkins happy.

@hppritcha hppritcha modified the milestones: v2.0.1, v2.0.0 Jul 5, 2016
@jsquyres
Copy link
Member

jsquyres commented Jul 5, 2016

Per discussion on the webex today, @hjelmn will separate this into two PRs:

  1. the regression fix for single-threaded fix will stay here and be in v2.0.0
  2. move the multi-threaded rdmacm fix into a separate PR and put it in v2.0.1

@jsquyres jsquyres modified the milestones: v2.0.0, v2.0.1 Jul 5, 2016
@hjelmn
Copy link
Member Author

hjelmn commented Jul 5, 2016

@jsquyres Removed the threading bug fix. Now only has the regression fix.

@jsquyres
Copy link
Member

jsquyres commented Jul 5, 2016

@hjelmn Thanks
@hppritcha Once CI finishes, good to go (although until Mellanox removes the multi-threaded RDMACM test, it's going to fail)

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1831/ for details.

@jsquyres jsquyres merged commit ddb21f7 into open-mpi:v2.x Jul 11, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rdmacm sometimes hangs in finalize in Mellanox Jenkins

8 participants