Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX seemingly causing Python MPI tests to fail with Open MPI v4.0.5 #6777

Open
jsquyres opened this issue Jun 26, 2019 · 23 comments
Open

UCX seemingly causing Python MPI tests to fail with Open MPI v4.0.5 #6777

jsquyres opened this issue Jun 26, 2019 · 23 comments

Comments

@jsquyres
Copy link
Member

jsquyres commented Jun 26, 2019

As reported on the mpi4py bitbucket, it looks like enabling UCX support in Open MPI v4.0.1 in Fedora 30 is causing some mpi4py tests to fail.

See the link above for more details, but the short version is:

  • Preliminary indications look like this is a bug in Open MPI and/or UCX, not mpi4py
  • The mpi4py tests in question fail with openmpi-4.0.1-1.fc31.x86_64, but succeed with openmpi-4.0.1-5.fc31.x86_64.
  • @opoplawski, who is the Open MPI maintainer for Fedora, confirmed that:
    1. He disabled UCX in the -5 version, which enabled the tests to work.
    2. The issue was supposedly fixed in UCX v1.5.2, so Orion re-enabled UCX Open MPI support in -6, and the tests broke again.
  • It looks like the corresponding MPI calls are writing past the buffer end for datatypes MPI_SIGNED_CHAR and MPI_SHORT.

@jladd-mlnx @artpol84 Can someone from Mellanox look into this?

FYI: @dalcinl @opoplawski

@artpol84
Copy link
Contributor

artpol84 commented Jul 1, 2019

@jsquyres

openmpi-4.0.1-1.fc31.x86_64 seems to be older than openmpi-4.0.1-5.fc31.x86_64. Can this indicate that the problem was fixed?

@artpol84
Copy link
Contributor

artpol84 commented Jul 1, 2019

FYI: @yosefe

@jsquyres
Copy link
Member Author

jsquyres commented Jul 1, 2019

@artpol84 No, please re-read my summary and/or the entire thread.

@artpol84
Copy link
Contributor

artpol84 commented Jul 9, 2019

He disabled UCX in the -5 version, which enabled the tests to work.

@artpol84 No, please re-read my summary and/or the entire thread.

@jsquyres, thanks. I missed it.

@jsquyres
Copy link
Member Author

@artpol84 Can Mellanox check to see if this is now fixed with Open MPI v4.0.2?

@artpol84
Copy link
Contributor

@jsquyres I don't expect it to be fixed as of now. UCX doesn't support 1B and 2B atomics.
We are planning to fix it in the near future, but it is not yet fixed.
@janjust, @jladd-mlnx, please correct me if I am wrong.

@opoplawski
Copy link
Contributor

I suspect that this was resolved at some point. At least, mpi4py test_rma is no longer failing on Fedora Rawhide.

@jsquyres
Copy link
Member Author

@artpol84 Since you're planning to support it in the near future, let's leave this open to track it.

@hppritcha
Copy link
Member

@artpol84 does UCX support 1 and 2 byte atomics now?

@yosefe
Copy link
Contributor

yosefe commented Mar 27, 2021

@artpol84 does UCX support 1 and 2 byte atomics now?

It does not

@hppritcha
Copy link
Member

closing as no longer being observed by @opoplawski

@dalcinl
Copy link
Contributor

dalcinl commented Mar 30, 2021

@hppritcha I believe the issues are not fully resolved yet.
I'm running Fedora 33, with openmpi-4.0.5-1.fc33.x86_64.
The following test run is with mpi4py/master.

$ mpiexec -n 1 python test/runtests.py --no-threads -v -i rma$ TestRMASelf
[0@optiplex] Python 3.9 (/usr/bin/python)
[0@optiplex] MPI 3.1 (Open MPI 4.0.5)
[0@optiplex] mpi4py 3.1.0a0 (/home/dalcinl/Devel/mpi4py-dev/build/lib.linux-x86_64-3.9/mpi4py)
testAccumulate (test_rma.TestRMASelf) ... ok
testAccumulateProcNullReplace (test_rma.TestRMASelf) ... ok
testAccumulateProcNullSum (test_rma.TestRMASelf) ... ok
testCompareAndSwap (test_rma.TestRMASelf) ... [1617082714.741125] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741161] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741172] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741179] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741203] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741211] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741216] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741362] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741378] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741388] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741415] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741428] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741437] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741593] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741606] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741614] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741646] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741656] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741678] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741849] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741882] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741895] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741936] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741951] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741964] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
ok
testFence (test_rma.TestRMASelf) ... ok
testFenceAll (test_rma.TestRMASelf) ... ok
testFetchAndOp (test_rma.TestRMASelf) ... [1617082714.742394] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742406] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742412] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742443] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742449] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742453] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742459] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742464] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742468] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742507] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742514] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742533] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742562] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742568] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742572] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742594] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742600] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742606] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742966] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742982] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742992] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743084] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743109] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743119] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743128] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743137] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743158] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743201] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743225] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743234] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743312] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743322] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743331] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743341] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743351] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743360] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743815] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743826] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743834] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743873] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743880] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743888] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743895] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743902] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743910] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743965] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743977] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743985] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744054] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744062] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744069] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744076] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744082] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744089] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744165] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744174] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744182] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744239] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744247] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744255] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744278] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744286] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744293] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744805] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744837] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744849] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744919] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744930] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744941] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744953] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744964] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744975] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.745031] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745045] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745056] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745141] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745152] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745163] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745175] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745186] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745197] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745815] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745827] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745836] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745887] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745896] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745905] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745914] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745923] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745931] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
ok
testFlush (test_rma.TestRMASelf) ... ok
testGetAccumulate (test_rma.TestRMASelf) ... ok
testGetAccumulateProcNull (test_rma.TestRMASelf) ... ok
testGetProcNull (test_rma.TestRMASelf) ... ok
testPostWait (test_rma.TestRMASelf) ... ok
testPutGet (test_rma.TestRMASelf) ... ok
testPutProcNull (test_rma.TestRMASelf) ... ok
testStartComplete (test_rma.TestRMASelf) ... ok
testStartCompletePostTest (test_rma.TestRMASelf) ... ok
testStartCompletePostWait (test_rma.TestRMASelf) ... ok
testSync (test_rma.TestRMASelf) ... ok

----------------------------------------------------------------------
Ran 18 tests in 0.089s

OK

@yosefe
Copy link
Contributor

yosefe commented Mar 30, 2021

seems MPI is trying to use 1 byte and 2 byte atomics, which is not supported by UCX

@hoopoepg
Copy link
Contributor

and 16 byte datatypes

@dalcinl
Copy link
Contributor

dalcinl commented Mar 30, 2021

Yes, I have a test that loops over datatypes and performs CompareAndSwap and FetchAndOp. Isn't that a reasonable test? IMHO, If an MPI implementation cannot support the operation for some datatypes, it should barf with an error.

@yosefe
Copy link
Contributor

yosefe commented Mar 30, 2021

and 16 byte datatypes

datatype=16 is 2-byte contig

@dalcinl
Copy link
Contributor

dalcinl commented Mar 30, 2021

datatype=16 is 2-byte contig

And datatype=128 is 16-byte contig, right?

@yosefe
Copy link
Contributor

yosefe commented Mar 30, 2021

And datatype=128 is 16-byte contig, right?

right
osc/ucx should fallback to active messages if datatype is not 4/8 bytes

@hppritcha hppritcha reopened this Mar 30, 2021
@gpaulsen gpaulsen changed the title UCX seemingly causing Python MPI tests to fail with Open MPI v4.0.1 UCX seemingly causing Python MPI tests to fail with Open MPI v4.0.5 Apr 23, 2021
@gpaulsen
Copy link
Member

@dalcinl What version of UCX are you using with Fedora 33?

@dalcinl
Copy link
Contributor

dalcinl commented Apr 24, 2021

@gpaulsen These are the current openmpi and ucx packages in my Fedora 33:

$ rpm -qa | egrep "(ucx|openmpi)"
openmpi-4.0.5-1.fc33.x86_64
openmpi-devel-4.0.5-1.fc33.x86_64
ucx-1.9.0-1.fc33.x86_64

@hjelmn
Copy link
Member

hjelmn commented Apr 24, 2021

Looking at that log osc/ucx should not be in use. It should be losing to osc/rdma when not using a mellanox HCA. In the failure case NP=1 which I don't think should be using UCX ever.

Doesn't address the issue that osc/ucx is doing the wrong thing (it is) but does indicate that the version of Open MPI is using the wrong components by default.

@dalcinl
Copy link
Contributor

dalcinl commented Apr 25, 2021

@hjelmn mpi4py initializes MPI with THREAD_MULTIPLE. Perhaps that is affecting component selection?

@thangckt
Copy link

I have a problem " Caught signal 11 (Segmentation fault: address not mapped to object at address" when run a Python code using OpenMPI with UCX.
When I disable UCX, the code can run without any error.

Does anyone know why? Or any hint that I can try to void this error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests