Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption with OpenMPI 4.0.4 and UCX 1.9.0 #8442

Closed
jayeshkrishna opened this issue Feb 3, 2021 · 49 comments
Closed

Data corruption with OpenMPI 4.0.4 and UCX 1.9.0 #8442

jayeshkrishna opened this issue Feb 3, 2021 · 49 comments

Comments

@jayeshkrishna
Copy link

jayeshkrishna commented Feb 3, 2021

There seems to be a data corruption issue with OpenMPI 4.0.4 + UCX 1.9.0 (and 4.1.0 + UCX 1.9.0) when communicating data between processes using MPI. The attached test code is an MPI program in C++ that shows the issue when run on more than one node in a local cluster. Restricting the UCX transport protocol to sm, tcp, ud is a workaround for the issue (Not setting UCX_TLS or setting it to rc, dc results in data corruption). The issue does not show up with other MPI libraries (Intel MPI) installed on the system. This issue may be related to #8321

More information on the issue is provided below,

  • The issue shows up only when running the program on more than 1 node. Running the attached test program with 2 processes that span 2 nodes can reproduce the issue on our local cluster
  • The test program gathers (using MPI point to point communication rather than an MPI_Gather) data for multiple variables (an hvector of indexed variables) to rank 0. The number of variables gathered by the test program can be controlled by the "--nvars=x" command line option. The data corruption issue is related to the amount of data gathered since it shows up only when gathering many variables (In our cluster, the issue does not show up when gathering <= 10 variables. It also does not show up for all cases > 10 variables.).
  • The workaround is to set the environment variable "UCX_TLS" to "sm,ud" to restrict the UCX transport protocols used by OpenMPI+UCX
  • I built the latest version of OpenMPI, 4.1.0, on the system and the issue still exists.

The OpenMPI and UCX install information and the test program is available at https://gist.github.com/jayeshkrishna/5d053d3d5bba11359ea2dc82c435c3ea. On a successful run the test program prints out "SUCCESS: Validation successful". It prints out "ERROR: Validation failed" when data validation fails and prints out the first index in the buffer (gather buffer) where the validation failed.

Building the test program

Note that argparser.[cpp/h] are helper code and you would just need to look at mpi_gather.cpp (the main() and create_snd_rcv_dt() routines).

mpicxx argparser.cpp mpi_gather.cpp -o mpi_gather_openmpi

Running the test program

The issue shows up with certain values (--nvars=13) of "--nvars" on our local cluster with processes across multiple nodes. So I use a loop to test it out,

for iproc in 2 4 16 64 96 128
do
  for i in 2 4 8 9 10 11 12 13 14 15 16 18 19 20 32
  do
    echo "--------- MPI Gather nprocs = $iproc (nvars = $i) ----------------"
    mpiexec -n $iproc ./mpi_gather_openmpi --nvars=$i
  done
done

Please let me know if you need further information on this issue.

@yosefe
Copy link
Contributor

yosefe commented Feb 3, 2021

@jayeshkrishna can you pls try setting UCX_MEM_EVENTS=n to check if the issue is related to memory hooks?

@jayeshkrishna
Copy link
Author

I just tried setting "UCX_MEM_EVENTS=n" (env variable) but that did not help (the reproducer validation failed)

@jsquyres
Copy link
Member

jsquyres commented Feb 4, 2021

@open-mpi/ucx FYI

@jsquyres jsquyres assigned yosefe and hoopoepg and unassigned hoopoepg Feb 4, 2021
@hoopoepg
Copy link
Contributor

hoopoepg commented Feb 5, 2021

hi @jayeshkrishna

thank you for bug report.

I tried to reproduce your issue and it seems our compiler is too old to process regexp - I see error:

--------- MPI Gather nprocs = 2 (nvars = 2) ----------------
terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error
[jazz17:111574] *** Process received signal ***
[jazz17:111574] Signal: Aborted (6)
[jazz17:111574] Signal code:  (-6)

we are using compiler: GCC 4.8.5 20150623 (Red Hat 4.8.5-36)

what compiler do you use?
I'm going to re-implement your reproducer to use getopt function to parse arguments. will let you know about progress

@sarats
Copy link

sarats commented Feb 5, 2021

@hoopoepg Jayesh has used Intel-20.0.4 (from the gist above).
GNU 4.8.5 is too old, a recent gcc shouldn't have issues as far as I know.

@sarats
Copy link

sarats commented Feb 5, 2021

@hoopoepg Apparently, gcc 4.9 added "Improved support for C++11, including: support for <regex>;"
https://gcc.gnu.org/gcc-4.9/changes.html

@hoopoepg
Copy link
Contributor

hoopoepg commented Feb 5, 2021

ok, thank you

@jayeshkrishna
Copy link
Author

Yes, a newer version of gcc should work fine. We use gcc 8.2.0 for our gnu tests (We require gcc 4.9+ for building Scorpio + E3SM due to limited C++ regex support in gcc < 4.9).

@hoopoepg
Copy link
Contributor

hoopoepg commented Feb 8, 2021

hi @jayeshkrishna
could you test your reproducer with variable
UCX_MAX_RNDV_LANES=1

thank you

@jayeshkrishna
Copy link
Author

Ok, I will try this as soon as our cluster finishes the scheduled maintenance today. Meanwhile, were you able to recreate the issue with the test program?

@gpaulsen
Copy link
Member

gpaulsen commented Feb 8, 2021

Added blocker as this is data corruption issue.

@jayeshkrishna
Copy link
Author

Setting the env variable UCX_TLS=rc I can recreate the issue on a single node.

On a single node run with UCX_TLS=rc and UCX_MAX_RNDV_LANES=1 the issue still exists.

The issue also exists with a multi node run with UCX_MAX_RNDV_LANES=1 (UCX_TLS not set)

@jayeshkrishna
Copy link
Author

Also note that argparser.cpp has a no_regex_parse() function to support compilers that do not support regex. If you are having issues with regex when running your code, you might want to try replacing

ap.parse(argc, argv);

in mpi_gather.cpp:57 with

ap.no_regex_parse(argc, argv);

(I would however recommend a newer version of the gnu compiler though)

@hoopoepg
Copy link
Contributor

hoopoepg commented Feb 9, 2021

@jayeshkrishna is it possible to reproduce issue using OMPI master branch?

@jayeshkrishna
Copy link
Author

I haven't tried the OpenMPI master but as mentioned in the issue description above the issue also occurs with OpenMPI 4.1.0 (OpenMPI 4.1.0 + UCX 1.9.0). Are you able to recreate the issue using the reproducer?

@hoopoepg
Copy link
Contributor

hoopoepg commented Feb 9, 2021

yes, we reproduced it, also reproduced using sharem memory transport...
we found really strange behavior in OMPI data pack engine and fould that issue is not reproduced on OMPI master branch

@hppritcha
Copy link
Member

@hoopoepg so you don't see this problem on master?

@hoopoepg
Copy link
Contributor

hoopoepg commented Feb 9, 2021

@hppritcha right, at least on configuration described by @jayeshkrishna and on our test configurations used for reproducing.
as I can see there are significant changes in datatype processing between 4.1 and master, I tried to backport it into 4.1 - but without success...

@hoopoepg
Copy link
Contributor

hoopoepg commented Feb 9, 2021

@hppritcha update: issue is reproduced on extended configuration with master branch (completed few minutes ago). Will continue to investigate tomorrow.

@sarats
Copy link

sarats commented Feb 15, 2021

Thanks @hoopoepg for identifying the cause of the problem.

@bosilca Just wondering if there is an interim workaround in case the final fix takes a while.

@jayeshkrishna
Copy link
Author

When will the fix for this issue be integrated to master/release (and are there any workarounds that we can use until the fix is available?)?

@jsquyres
Copy link
Member

jsquyres commented Mar 9, 2021

@bosilca mentioned on the call today (9 Mar 2021) that he's still looking into this.

@jayeshkrishna
Copy link
Author

Do you have any estimates on when this bug will be fixed in OpenMPI? We have been unable to switch to OpenMPI due to this bug (the "workaround" of switching to "sm,ud" avoids this issue for now but significantly slows down the simulations).

@jsquyres
Copy link
Member

jsquyres commented Mar 16, 2021

There's two parts to this issue:

  1. We discovered that we were accidentally using UCX for a bunch of non-IB/RoCE networks. The next releases of Open MPI (4.1.1 and 4.0.6) will return to using UCX for only IB and RoCE.
  2. There's a subtle issue in the datatype engine.

If you're not using IB or RoCE, you might be able to try the nightly 4.1.x snapshot tarball and see if that fixes your issue (meaning: the datatype issue may not occur if not using the UCX PML).

The datatype issue is still actively being worked on. Believe me, we want this fixed as soon as possible... ☹️

@sarats
Copy link

sarats commented Mar 16, 2021

Unfortunately, we are using HDR200 IB, so we will wait on a fix.
We really prefer a performant workaround than switching to non-IB optimal transports in UCX.

@jsquyres
Copy link
Member

Understood. Sorry for the delay. ☹️

@sarats
Copy link

sarats commented Mar 29, 2021

Since this affects infiniband specifically and also HPC-X (derived from OpenMPI), could someone from Nvidia/Mellanox aid in search for a fix/workaround?

@sarats
Copy link

sarats commented Apr 6, 2021

Great to hear about the progress on this front. Thanks for @bosilca's hard work on this.

Note: there will be an additional PR after this one for fixing #8442

If desired, we can run our full application tests with the patch.

@bosilca
Copy link
Member

bosilca commented Apr 7, 2021

@sarats the patch is ready, please test and let me know the outcome.

@sarats
Copy link

sarats commented Apr 7, 2021

@bosilca To clarify, we just get the latest open-mpi:master for testing? From the above comment, I thought we had to wait for something else.

@bosilca
Copy link
Member

bosilca commented Apr 7, 2021

@sarats the PR is not yet merged, so the master is not enough. You need to test #8769.

@jsquyres
Copy link
Member

jsquyres commented Apr 7, 2021

One more point: once we verify that the fix that went into master (#8735) plus #8769 fixes the actual issue, then we'll generate backports for the relevant release branches.

@jayeshkrishna
Copy link
Author

I was trying to test the fix (master + PR #8769, OpenMPI info for my build : https://gist.github.com/jayeshkrishna/d6158a7fa7f9b8add9d3c158ce22a9e0) and see the following warnings when I run simple MPI tests,

ucp_context.c:1529 UCX  WARN  UCP version is incompatible, required: 1.10, actual: 1.9

The ucx_info (-v) command shows,

# UCT version=1.10.0 revision a212a09
...

I also noticed that while configuring OpenMPI, I see the following in the configure output,

...
--- MCA component common:ucx (m4 configuration macro)
...
checking for UCX version compatibility... yes
checking UCX version... ok (not 1.8.x)
...
--- MCA component btl:uct (m4 configuration macro)
checking for MCA component btl:uct compile mode... static
checking check uct version... yes
checking UCT version compatibility... UCT version not compatible - need UCX 1.9 or older
checking if MCA component btl:uct can compile... no
...

Any ideas on how to debug this warning?

@hoopoepg
Copy link
Contributor

hoopoepg commented Apr 8, 2021

You may build UCX from GitHub master branch and use it in ompi build

@jayeshkrishna
Copy link
Author

The warning that I was seeing in my builds was due to an issue on our local cluster (the sysadmins had installed different versions of UCX on login and compute nodes, once that was fixed the warnings are gone).
I tried the MPI gather tests on the cluster and it runs successfully with the fix (OpenMPI master + PR #8769). I am going to do more tests tonight and will update on the results.

@jayeshkrishna
Copy link
Author

jayeshkrishna commented Apr 19, 2021

Is there a way I can build OpenMPI master with PMI2 support? Apparently the MPI gather tests were running without PMI support (singleton) from slurm since I used the internal OpenMPI PMI (PMIx) to build the library.

============ OpenMPI configure error if trying to use PMI (PMI2) in /usr ===============
...
configure: WARNING: Open MPI no longer supports PMI-1 or PMI-2 libraries.
configure: WARNING: PMIx is now required. Either the internal version or an
configure: WARNING: external version of PMIx may be used, so long as the
configure: WARNING: external version is compatible with the PMIx v2.2
configure: WARNING: Standard or higher. Note that cross-version support
configure: WARNING: within the OpenPMIx library can be used by this OMPI
configure: WARNING: to interact with environments based on other PMIx
configure: WARNING: versions.
configure: error: Build cannot continue.
...
===================================================================
===================== srun does not seem to support PMIx =================
$ srun --mpi=list
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2
===================================================================

@rhc54
Copy link
Contributor

rhc54 commented Apr 19, 2021

Afraid not - you'll need to have Slurm configured with PMIx support.

@jsquyres
Copy link
Member

Afraid not - you'll need to have Slurm configured with PMIx support.

That is the best option (the implication here is that the upcoming Open MPI v5 will not support PMI1/PMI2). Another option is to use an older version of Open MPI that still has support for PMI2.

@jayeshkrishna
Copy link
Author

ok, thanks. I am working with the sysadmins here to address the issue.

@jayeshkrishna
Copy link
Author

Can you backport the changes in PR #8735 and this PR (PR #8769) to a branch off OpenMPI 4.0.4 release? I can use that branch to test the fixes on our cluster.

  • Since slurm is being used in a production environment the sysadmins cannot recompile it at the time to include support for PMIx
  • I tried disabling support for slurm and using the OpenMPI launcher directly with OpenMPI master + PR Fixing the partial pack unpack issue. #8769 (No external PMI library was specified, so the OpenMPI internal PMIx must have been used by the build). However I started getting crashes in ROMIO (again related to PMI database operations) with this build.
  • I tried manually picking (create patch and apply) changes in PR Reenable heterogeneous support. #8735 to the source from OpenMPI 4.0.4 release, but the sources have diverged and applying patches failed (some of these changes were trivial but I do not want to step on merge landmines)

So if you can backport the required changes (in the 2 PRs) to a branch off OpenMPI 4.0.4 I can test it out in our cluster.

@jayeshkrishna
Copy link
Author

@jsquyres / @bosilca : Would a backport of these 2 PRs (#8735 + #8769) be possible on OpenMPI 4.0.4? It would be great if I can test the changes with PMI2 on our cluster.
(OpenMPI 4.0.4 is the version currently installed on the cluster, so would be the easiest one to test, but any version that supports PMI2+slurm would also work)

@jsquyres
Copy link
Member

jsquyres commented Apr 26, 2021

@jayeshkrishna As you probably saw, we released 4.1.1 with the partial datatype fix. That might be a good solution for you.

We're working on back-porting that same partial datatype fix back to the v4.0.x branch -- @hppritcha may have a PR up for that in the next day or so. (update: #8859)

We have not discussed bringing back the heterogeneous functionality to either of the 4.x branches.

@jayeshkrishna
Copy link
Author

Thanks! I will try 4.1.1 and see if it works on our cluster.

@jayeshkrishna
Copy link
Author

I tried OpenMPI 4.1.1 and it fixes the issues on our cluster! I tried the MPI reproducer on multiple nodes and also tried the E3SM test that was failing with OpenMPI 4.0.4 and both tests pass.

@sarats
Copy link

sarats commented Apr 27, 2021

Great news! Thanks @jayeshkrishna for narrowing down and putting together a small reproducer to get this started. Thanks to everyone in the OpenMPI team, esp. @bosilca for the hard work in fixing this.

@awlauria
Copy link
Contributor

I tried this on master and could not reproduce. Based on that and the previous comments, I believe we can call this issue closed.

#8859 has been merged to v4.0.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants