-
Notifications
You must be signed in to change notification settings - Fork 931
btl/openib: Disqualify rdmacm CPC if MPI_THREAD_MULTIPLE #1861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@bharatpotnuri Github pro tip: If you say "Fixes #1848" in the commit message, it'll auto-close #1848 when this PR gets merged. |
|
Build Failed with XL compiler! Please review the log, and get in touch if you have questions. Gist: https://gist.github.com/8c902a00aebe52469163a4742cc2e52a |
|
bot:ibm:retest |
|
Build Failed with XL compiler! Please review the log, and get in touch if you have questions. Gist: https://gist.github.com/b21fa0fa4ec5be8f9cc8530cfa2f7488 |
|
This is a valid error from IBM (sorry the first test was hiding the error message): |
|
@jjhursey Looks like the failure is due to |
|
|
|
Yes, I agree, working on it. @jsquyres I am not able to fully understand how these wrappers/libs get complied and linked in our current problem. I didn't get much regarding wrappers from the ompi resources available. Could you/somebody please shed some light on this. |
|
For the XL compiler CI test we have been using |
|
Hence, with |
|
@bharatpotnuri Was the final determination that rdmacm is not thread-safe? Also, you can not reference ompi_mpi_thread_multiple from opal since it is in libmpi. |
|
@hjelmn We are seeing ompi failures for rdmacm cpc with MPI_THREAD_MULTIPLE (I am looking for what exactly the failure is). As it need vigorous testing too, we decided to disqualify rdmacm for multithread apps for now. |
|
@bharatpotnuri @larrystevenwise Any update? |
|
@bharatpotnuri is out until tomorrow. |
|
Commit 9a08fb7 should fix the build failure seen with IBM-CI(XL Compiler). |
|
Might as well squash those two commits together (i.e., there's not much use in having a PR with an incorrect commit followed by another commit to fix that wrong commit). Have you tested whether the RDMACM CPC is actually thread safe, or are you just disabling it? |
|
@jsquyres Yes agreed, I could not find a proper way to update older PR. I found this method googling. will correct it next time. |
|
@bharatpotnuri You can git rebase to squash the two commits, and then force push to your branch (this is just about the only acceptable time to do a force push!). A Github PR always shows the current status of your source branch, so if you change the state of that branch, the PR will automatically update to show that. Per "yes, there are threading problems": cool. I just added an update to #1841 to confirm / cross-reference that data point. |
|
@bharatpotnuri Are you going to squash these 2 commits down into one commit? |
The rdmacm CPC in the openib BTL is not thread safe. The rdmacm CPC should disqualify itself (instead of failing in random ways) if MPI_THREAD_MULTIPLE is the thread level. Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
|
@jsquyres Yes, squashed. |
|
Looks like a temporary network failure at Mellanox. Try again... bot:mellanox:retest |
|
@larrystevenwise and/or @bharatpotnuri Can you please create two PRs; one each that cherry-picks this commit to:
We're feature-closing the v2.1.0 release (i.e., the v2.x branch) this Friday -- please create the PRs by then. Thanks! |
|
@bharatpotnuri Thank you! Please don't forget to a) get them reviewed, and b) mark them with appropriate labels (e.g., bug) and the appropriate milestone (if the milestone is not set, we don't see them in our bug scrub / PR reviews). |
|
@jsquyres Who in this case is required to review these changes? :) |
|
@bharatpotnuri Anyone. Including @larrystevenwise. 😄 We trust your discretion here -- we're a co-operative community, after all. The intent is that you get a reasonable code review. If you'd like a co-worker to do it, great. If you'd like someone from the community to do it, that's fine too. Basically: it's the PR author's responsibility to get the review done, but the PR author should feel free to reach out to the community if needed. Make sense? |
|
@jsquyres Yes. |
The rdmacm CPC in the openib BTL is not thread safe. The rdmacm CPC
should disqualify itself (instead of failing in random ways) if
MPI_THREAD_MULTIPLE is the thread level.
Fixes #1848
Signed-off-by: Potnuri Bharat Teja bharat@chelsio.com