Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processes returns wrong results on OpenMPI 4.0.4 unless $OMPI_MCA_pml is set to "^ucx" #8321

Closed
ksiazekm opened this issue Dec 27, 2020 · 57 comments · Fixed by #8511
Closed

Comments

@ksiazekm
Copy link

OpenMPI 4.0.4
Installed from official Fedora repository with use of dnf.
OS: Fedora 33, the latest
gcc 10.2.1 20201125 (Red Hat 10.2.1-9)
AMD FX8350,

some processes randomly return wrong results on the given configuration. When
tested the same source code on Debian GNU/Linux 7 (wheezy) with OpenMPI 1.4.5
it returns correct results each time. Test on each configuration was executed
30 times in a loop. OpenMPI 4.0.4 started to work properly after set
$OMPI_MCA_pml to "^ucx".

The source code: gist link

@hppritcha
Copy link
Member

It would be useful to know the version of UCX is being used.

@ksiazekm
Copy link
Author

ksiazekm commented Jan 5, 2021

It would be useful to know the version of UCX is being used.

ucx.x86_64 1.8.1-3.fc33

@jsquyres
Copy link
Member

jsquyres commented Jan 5, 2021

Can you provide more detail about exactly what is going wrong, and/or a small example showing the problem?

@jsquyres
Copy link
Member

jsquyres commented Jan 5, 2021

I'm the worst -- you provided a gist link with a sample and I missed it. We'll investigate...

@hppritcha hppritcha self-assigned this Jan 5, 2021
@hppritcha
Copy link
Member

@ksiazekm i'm having reproducing this. First how many MPI processes are you using when you observe the problem and what is the tabsize you're supplying?

@ksiazekm
Copy link
Author

ksiazekm commented Jan 9, 2021

Here you have 30 executions of the code from the previous gist: Test Results

@ggouaillardet
Copy link
Contributor

@hppritcha FWIW, I was able to reproduce the issue on a Fedora 33 virtual machine, using Open MPI and UCX provided by the distro. I was only able to find discarding the pml/ucx component was a workaround.
I did not have time to dig a bit more and figure out whether the root cause is in Open MPI, or UCX or both.

@awlauria
Copy link
Contributor

So far I can't reproduce on power9 + rhel8.2. I tried v4.0.x latest and also the v4.0.4 release tarball.

@hoopoepg
Copy link
Contributor

@hppritcha FWIW, I was able to reproduce the issue on a Fedora 33 virtual machine, using Open MPI and UCX provided by the distro

hi
have you any specific configuration in virtual machine? what processor of host system?
we are trying to reproduce issue but have no success

@ggouaillardet
Copy link
Contributor

That is a pretty standard Virtualbox VM running on a x86_64 (intel) processor.

@hoopoepg
Copy link
Contributor

what is number of CPU? memory volume?
and could you run there ucx_info -d there & post output here?

thank you

@ggouaillardet
Copy link
Contributor

1 GB
1 CPU

#
# Memory domain: sockcm
#     Component: sockcm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#   Transport: tcp
#      Device: enp0s3
#
#      capabilities:
#            bandwidth: 113.16/ppn + 0.00 MB/sec
#              latency: 5776 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#
# Connection manager: tcp
#      max_conn_priv: 2040 bytes
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#   Transport: self
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#   Transport: sysv
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: posix
#     Component: posix
#             allocate: unlimited
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#   Transport: posix
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: none
#

@ggouaillardet
Copy link
Contributor

use

mpirun --oversubscribe -np 4 a.out 120000000

you might have to run it several times to evidence the issue.

@jsquyres jsquyres added this to the v4.0.6 milestone Jan 12, 2021
@jsquyres
Copy link
Member

I'm pre-emptively adding a v4.1 label on this issue too -- I'm assuming that the issue exists over there as well.

@gpaulsen
Copy link
Member

I set the blocker label until we have more information.

@jsquyres jsquyres added bug and removed question labels Jan 12, 2021
@ggouaillardet
Copy link
Contributor

@jsquyres @gpaulsen I was only able to reproduce this on a Fedora 33 (virtual) machine, and the workaround is to discard the pml/ucx component (built on top of the distro provided UCX).

I did not investigate more (lack of time) and could not reach a conclusion w.r.t. the root cause of this issue (Open MPI? UCX? Fedora? a combination of these?)

@opoplawski meanwhile, you might want to skip the pml/ucx component in the Fedora package by default. That can be achieved by adding a

pml = ^ucx

line into /etc/openmpi-x86_64/openmpi-mca-params.conf

@jsquyres jsquyres assigned hoopoepg and hppritcha and unassigned hppritcha Jan 13, 2021
@jsquyres
Copy link
Member

@jladd-mlnx @hoopoepg Someone from the UCX community needs to look into this.

@hoopoepg
Copy link
Contributor

we can't reproduce issue on our environment :(

@opoplawski
Copy link
Contributor

opoplawski commented Jan 28, 2021

I'm happy to push a ucx 1.9 update to Fedora 33 if that seems appropriate. Is updating UCX versions okay? There isn't a soname bump.

@hppritcha
Copy link
Member

@opoplawski thanks Orion. Transferring to Nvidia to let them decide how to deal with this from the Open MPI side. Recommend a configure check to not build UCX pml and osc if the UCX being installed is older than 1.9. @hoopoepg

@hppritcha hppritcha removed their assignment Jan 29, 2021
@jsquyres
Copy link
Member

@hppritcha Better to use the general "UCX" team name to alert all the relevant people on the UCX side, not just @hoopoepg.

FYI: @open-mpi/ucx

@yosefe
Copy link
Contributor

yosefe commented Jan 29, 2021

@hppritcha @jsquyres how about pml/ucx would disqualify itself at runtime, if UCX library version is older than v1.9, with some warning and mca var to override this?
I guess we could update that minimal version from time to time as we move to newer UCX versions

@jsquyres
Copy link
Member

If you really want to stop supporting older versions of UCX, you can certainly do that.

IMHO: It would be best to have both configure.m4-time and run time checks for versions, just to be safe.

@opoplawski
Copy link
Contributor

I've submitted ucx 1.9 to Fedora 33 - https://bodhi.fedoraproject.org/updates/FEDORA-2021-613166cadb Please test and provide feedback.

@jsquyres
Copy link
Member

jsquyres commented Feb 1, 2021

@ksiazekm Just out of curiosity, are you seeing the same behavior with Open MPI v4.1.0 as well?

@jsquyres
Copy link
Member

jsquyres commented Feb 8, 2021

@open-mpi/ucx Can you give a definitive ruling on what you plan to do? I.e., are UCX versions prior to v1.9.0 problematic?

@hoopoepg
Copy link
Contributor

hoopoepg commented Feb 9, 2021

there is one more issue #8442 which may be related to this one, and we reproduced it on our environment. working over idenrifying.

@hoopoepg
Copy link
Contributor

hoopoepg commented Feb 9, 2021

@ksiazekm could you test latest OMPI (4.0.5)+UCX (1.9) available in Fedora 33?

thank you

@ksiazekm
Copy link
Author

ksiazekm commented Feb 9, 2021

I can confirm that it works properly with:

  • ucx.x86_64 1.9.0-1.fc33
  • openmpi.x86_64 4.0.5-1.fc33

@hppritcha
Copy link
Member

@open-mpi/ucx just to note there seem to be some significant differences between this issue and #8442 namely this one doesn't seem to be reproducible when using RC UCX_TLS, whereas #8442 seems to be avoidable using sm, tcp. Still they may be related although the conditions under which they appear wrt UCX_TLS are different.

@hoopoepg
Copy link
Contributor

@ksiazekm could you provide output of command ```ucx_info -d" from system where data corruption happened?

thank you

@hoopoepg
Copy link
Contributor

@open-mpi/ucx Can you give a definitive ruling on what you plan to do? I.e., are UCX versions prior to v1.9.0 problematic?

Hi @jsquyres
UCX v1.8 has data race in some configuration when used TCP and SHM transports simultaneously. In UCX v1.9 this data flow was updated and issue doesn’t appear anymore. The final fix was implemented in UCX v1.10 in PR openucx/ucx#5936.

Is it still important enough to add warning given that we are pushing updated package for Fedora?

@jsquyres
Copy link
Member

Many people use Open MPI outside of Fedora.

If you want to restrict Open MPI to only use UCX >= v1.10, you need to update the configury.

Silent data corruption is BAD.

@yosefe
Copy link
Contributor

yosefe commented Feb 16, 2021

@jsquyres IMO, runtime version check would be more strict (in case the user compiles OpenMPI with one UCX version, but runs with another version), WDYT?

@jsquyres
Copy link
Member

That would be fine as well.

But we definitely also like configure-time failures (with helpful explanation messages). E.g., if someone tries to compile with an old/unsupported UCX, it should fail right away during configure (vs. succeeding to configure, build, and install, and then only fail at run time).

Make sense?

@yosefe
Copy link
Contributor

yosefe commented Feb 16, 2021

@jsquyres yes, that makes sense. just to make sure, it would not be a problem for OpenMPI that we essentially drop support for older UCX versions?

@yosefe
Copy link
Contributor

yosefe commented Feb 16, 2021

adding @shamisp as well

@ksiazekm
Copy link
Author

ksiazekm commented Feb 16, 2021

@hoopoepg here you have:

$ ucx_info -d
#
# Memory domain: sockcm
#     Component: sockcm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#   Transport: tcp
#      Device: enp3s6
#
#      capabilities:
#            bandwidth: 11.32/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#
# Connection manager: tcp
#      max_conn_priv: 2040 bytes
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#   Transport: self
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#   Transport: sysv
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: posix
#     Component: posix
#             allocate: unlimited
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#   Transport: posix
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: none
#

@hoopoepg
Copy link
Contributor

@ksiazekm thank you for info.
You have same configuration as we reproduced issue and upgrade to UCX 1.9 should resolve issue.

@ksiazekm
Copy link
Author

ksiazekm commented Feb 17, 2021

@hoopoepg indeed, as I wrote in the comment :)

@jsquyres
Copy link
Member

@jsquyres yes, that makes sense. just to make sure, it would not be a problem for OpenMPI that we essentially drop support for older UCX versions?

Please fix #8489, and then if you strand your customers with older versions of UCX, I don't really care.

@jsquyres
Copy link
Member

@open-mpi/ucx What version of UCX is known to be good -- is it v1.9.0, or the upcoming v1.10.0? (I ask because I see that 1.9.0 is still the current release on https://github.com/openucx/ucx).

Given that this is silent data corruption, and per the discussion above, it sounds like we need both a configure-time check and a run-time check to ensure that the UCX PML is running with >= UCX v1.GOOD_VERSION. Is there any progress on this? We need this ASAP to get new Open MPI releases out the door.

@shamisp
Copy link
Contributor

shamisp commented Feb 23, 2021

@jsquyres , FYI we documented this critical issue on the front page https://github.com/openucx/ucx#known-critical-issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.