-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX seemingly causing Python MPI tests to fail with Open MPI v4.0.5 #6777
Comments
|
FYI: @yosefe |
@artpol84 No, please re-read my summary and/or the entire thread. |
@artpol84 Can Mellanox check to see if this is now fixed with Open MPI v4.0.2? |
@jsquyres I don't expect it to be fixed as of now. UCX doesn't support 1B and 2B atomics. |
I suspect that this was resolved at some point. At least, mpi4py test_rma is no longer failing on Fedora Rawhide. |
@artpol84 Since you're planning to support it in the near future, let's leave this open to track it. |
@artpol84 does UCX support 1 and 2 byte atomics now? |
It does not |
closing as no longer being observed by @opoplawski |
@hppritcha I believe the issues are not fully resolved yet.
|
seems MPI is trying to use 1 byte and 2 byte atomics, which is not supported by UCX |
and 16 byte datatypes |
Yes, I have a test that loops over datatypes and performs CompareAndSwap and FetchAndOp. Isn't that a reasonable test? IMHO, If an MPI implementation cannot support the operation for some datatypes, it should barf with an error. |
datatype=16 is 2-byte contig |
And datatype=128 is 16-byte contig, right? |
right |
@dalcinl What version of UCX are you using with Fedora 33? |
@gpaulsen These are the current openmpi and ucx packages in my Fedora 33:
|
Looking at that log osc/ucx should not be in use. It should be losing to osc/rdma when not using a mellanox HCA. In the failure case NP=1 which I don't think should be using UCX ever. Doesn't address the issue that osc/ucx is doing the wrong thing (it is) but does indicate that the version of Open MPI is using the wrong components by default. |
@hjelmn mpi4py initializes MPI with THREAD_MULTIPLE. Perhaps that is affecting component selection? |
I have a problem " Caught signal 11 (Segmentation fault: address not mapped to object at address" when run a Python code using OpenMPI with UCX. Does anyone know why? Or any hint that I can try to void this error? |
As reported on the mpi4py bitbucket, it looks like enabling UCX support in Open MPI v4.0.1 in Fedora 30 is causing some mpi4py tests to fail.
See the link above for more details, but the short version is:
MPI_SIGNED_CHAR
andMPI_SHORT
.@jladd-mlnx @artpol84 Can someone from Mellanox look into this?
FYI: @dalcinl @opoplawski
The text was updated successfully, but these errors were encountered: