Skip to content

Conversation

@bharatpotnuri
Copy link
Contributor

@bharatpotnuri bharatpotnuri commented Jul 12, 2016

The rdmacm CPC in the openib BTL is not thread safe. The rdmacm CPC
should disqualify itself (instead of failing in random ways) if
MPI_THREAD_MULTIPLE is the thread level.

Fixes #1848

Signed-off-by: Potnuri Bharat Teja bharat@chelsio.com

@jsquyres
Copy link
Member

@bharatpotnuri Github pro tip: If you say "Fixes #1848" in the commit message, it'll auto-close #1848 when this PR gets merged.

@ibm-ompi
Copy link

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/8c902a00aebe52469163a4742cc2e52a

@jjhursey
Copy link
Member

bot:ibm:retest

@ibm-ompi
Copy link

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

Gist: https://gist.github.com/b21fa0fa4ec5be8f9cc8530cfa2f7488

@jjhursey
Copy link
Member

This is a valid error from IBM (sorry the first test was hiding the error message):

Making all in tools/wrappers
make[2]: Entering directory `/gpfs/gpfs_stage1/jhursey/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/opal/tools/wrappers'
  CC       opal_wrapper.o
  GENERATE opal_wrapper.1
  CCLD     opal_wrapper
../../../opal/.libs/libopen-pal.so: undefined reference to `ompi_mpi_thread_multiple'

@bharatpotnuri
Copy link
Contributor Author

@jjhursey Looks like the failure is due to --disable-dlopen configure option. Any specific reason for it?

@jsquyres
Copy link
Member

--disable-dlopen is a valid Open MPI configure option; Open MPI needs to be able to work both with and without it.

@bharatpotnuri
Copy link
Contributor Author

bharatpotnuri commented Jul 13, 2016

Yes, I agree, working on it. @jsquyres I am not able to fully understand how these wrappers/libs get complied and linked in our current problem. I didn't get much regarding wrappers from the ompi resources available. Could you/somebody please shed some light on this.
Thanks.

@jjhursey
Copy link
Member

For the XL compiler CI test we have been using --disable-dlopen to get a bit better coverage of options, and, historically, that's just how we have built it for developer builds. Recently, it is necessary to avoid issue #1854. Once that issue is fixed we might have the CI test do two builds - one with and one without that option.

@jsquyres
Copy link
Member

--disable-dlopen does 2 things:

  1. Turns off all the code in Open MPI that dynamically finds/opens plugins at run time.
  2. Slurps up all plugins into their respective libraries (e.g., the openib BTL is slurped up into libopen-pal).

Hence, with --disable-dlopen, the openib BTL is not dlopen'ed at run time, but rather is intrinsically part of libopen-pal. It therefore cannot use any abstractions from the MPI layer because that's an abstraction break. I.e., the MPI layer depends on the OPAL layer -- not the other way around. You need to use the OPAL-layer threading symbol check.

@hjelmn
Copy link
Member

hjelmn commented Jul 13, 2016

@bharatpotnuri Was the final determination that rdmacm is not thread-safe? Also, you can not reference ompi_mpi_thread_multiple from opal since it is in libmpi.

@bharatpotnuri
Copy link
Contributor Author

bharatpotnuri commented Jul 14, 2016

@hjelmn We are seeing ompi failures for rdmacm cpc with MPI_THREAD_MULTIPLE (I am looking for what exactly the failure is). As it need vigorous testing too, we decided to disqualify rdmacm for multithread apps for now.
I could not access ompi_mpi_thread_multiple from opal for dynamic libraries are disabled. The actual cause should be what you said "can not reference ompi_mpi_thread_multiple from opal since it is in libmpi"

@jsquyres
Copy link
Member

@bharatpotnuri @larrystevenwise Any update?

@larrystevenwise
Copy link

@bharatpotnuri is out until tomorrow.

@jsquyres jsquyres added this to the v2.0.2 milestone Aug 16, 2016
@bharatpotnuri
Copy link
Contributor Author

Commit 9a08fb7 should fix the build failure seen with IBM-CI(XL Compiler).

@jsquyres
Copy link
Member

Might as well squash those two commits together (i.e., there's not much use in having a PR with an incorrect commit followed by another commit to fix that wrong commit).

Have you tested whether the RDMACM CPC is actually thread safe, or are you just disabling it?

@bharatpotnuri
Copy link
Contributor Author

@jsquyres Yes agreed, I could not find a proper way to update older PR. I found this method googling. will correct it next time.
Yes I tested it, there are issues with rdmacm for multi thread tests (#1848 (comment)). This should be a temporary change for release. We may need to keep the issue open until its fixed.

@jsquyres
Copy link
Member

@bharatpotnuri You can git rebase to squash the two commits, and then force push to your branch (this is just about the only acceptable time to do a force push!). A Github PR always shows the current status of your source branch, so if you change the state of that branch, the PR will automatically update to show that.

Per "yes, there are threading problems": cool. I just added an update to #1841 to confirm / cross-reference that data point.

@jsquyres
Copy link
Member

@bharatpotnuri Are you going to squash these 2 commits down into one commit?

The rdmacm CPC in the openib BTL is not thread safe. The rdmacm CPC
should disqualify itself (instead of failing in random ways) if
MPI_THREAD_MULTIPLE is the thread level.

Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
@bharatpotnuri
Copy link
Contributor Author

@jsquyres Yes, squashed.
Thanks.

@jsquyres
Copy link
Member

Looks like a temporary network failure at Mellanox. Try again...

bot:mellanox:retest

@jsquyres jsquyres removed this from the v2.0.2 milestone Sep 27, 2016
@jsquyres jsquyres merged commit 1a5a5fb into open-mpi:master Sep 27, 2016
@jsquyres
Copy link
Member

jsquyres commented Sep 27, 2016

@larrystevenwise and/or @bharatpotnuri Can you please create two PRs; one each that cherry-picks this commit to:

  • v2.x branch
  • v2.0.x branch

We're feature-closing the v2.1.0 release (i.e., the v2.x branch) this Friday -- please create the PRs by then. Thanks!

@bharatpotnuri
Copy link
Contributor Author

Created PRs #2133 and #2134 against v2.x and v2.0.x branches respectively.
Thanks.

@jsquyres
Copy link
Member

@bharatpotnuri Thank you! Please don't forget to a) get them reviewed, and b) mark them with appropriate labels (e.g., bug) and the appropriate milestone (if the milestone is not set, we don't see them in our bug scrub / PR reviews).

@bharatpotnuri
Copy link
Contributor Author

@jsquyres Who in this case is required to review these changes? :)

@jsquyres
Copy link
Member

@bharatpotnuri Anyone. Including @larrystevenwise. 😄 We trust your discretion here -- we're a co-operative community, after all. The intent is that you get a reasonable code review. If you'd like a co-worker to do it, great. If you'd like someone from the community to do it, that's fine too. Basically: it's the PR author's responsibility to get the review done, but the PR author should feel free to reach out to the community if needed.

Make sense?

@bharatpotnuri
Copy link
Contributor Author

@jsquyres Yes.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants