-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption with OpenMPI 4.0.4 and UCX 1.9.0 #8442
Comments
@jayeshkrishna can you pls try setting UCX_MEM_EVENTS=n to check if the issue is related to memory hooks? |
I just tried setting "UCX_MEM_EVENTS=n" (env variable) but that did not help (the reproducer validation failed) |
@open-mpi/ucx FYI |
thank you for bug report. I tried to reproduce your issue and it seems our compiler is too old to process regexp - I see error:
we are using compiler: GCC 4.8.5 20150623 (Red Hat 4.8.5-36) what compiler do you use? |
@hoopoepg Jayesh has used Intel-20.0.4 (from the gist above). |
@hoopoepg Apparently, gcc 4.9 added "Improved support for C++11, including: support for <regex>;" |
ok, thank you |
Yes, a newer version of gcc should work fine. We use gcc 8.2.0 for our gnu tests (We require gcc 4.9+ for building Scorpio + E3SM due to limited C++ regex support in gcc < 4.9). |
hi @jayeshkrishna thank you |
Ok, I will try this as soon as our cluster finishes the scheduled maintenance today. Meanwhile, were you able to recreate the issue with the test program? |
Added blocker as this is data corruption issue. |
Setting the env variable UCX_TLS=rc I can recreate the issue on a single node. On a single node run with UCX_TLS=rc and UCX_MAX_RNDV_LANES=1 the issue still exists. The issue also exists with a multi node run with UCX_MAX_RNDV_LANES=1 (UCX_TLS not set) |
Also note that argparser.cpp has a no_regex_parse() function to support compilers that do not support regex. If you are having issues with regex when running your code, you might want to try replacing
in mpi_gather.cpp:57 with
(I would however recommend a newer version of the gnu compiler though) |
@jayeshkrishna is it possible to reproduce issue using OMPI master branch? |
I haven't tried the OpenMPI master but as mentioned in the issue description above the issue also occurs with OpenMPI 4.1.0 (OpenMPI 4.1.0 + UCX 1.9.0). Are you able to recreate the issue using the reproducer? |
yes, we reproduced it, also reproduced using sharem memory transport... |
@hoopoepg so you don't see this problem on master? |
@hppritcha right, at least on configuration described by @jayeshkrishna and on our test configurations used for reproducing. |
@hppritcha update: issue is reproduced on extended configuration with master branch (completed few minutes ago). Will continue to investigate tomorrow. |
When will the fix for this issue be integrated to master/release (and are there any workarounds that we can use until the fix is available?)? |
@bosilca mentioned on the call today (9 Mar 2021) that he's still looking into this. |
Do you have any estimates on when this bug will be fixed in OpenMPI? We have been unable to switch to OpenMPI due to this bug (the "workaround" of switching to "sm,ud" avoids this issue for now but significantly slows down the simulations). |
There's two parts to this issue:
If you're not using IB or RoCE, you might be able to try the nightly 4.1.x snapshot tarball and see if that fixes your issue (meaning: the datatype issue may not occur if not using the UCX PML). The datatype issue is still actively being worked on. Believe me, we want this fixed as soon as possible... |
Unfortunately, we are using HDR200 IB, so we will wait on a fix. |
Understood. Sorry for the delay. |
Since this affects infiniband specifically and also HPC-X (derived from OpenMPI), could someone from Nvidia/Mellanox aid in search for a fix/workaround? |
@sarats the patch is ready, please test and let me know the outcome. |
@bosilca To clarify, we just get the latest open-mpi:master for testing? From the above comment, I thought we had to wait for something else. |
I was trying to test the fix (master + PR #8769, OpenMPI info for my build : https://gist.github.com/jayeshkrishna/d6158a7fa7f9b8add9d3c158ce22a9e0) and see the following warnings when I run simple MPI tests,
The ucx_info (-v) command shows,
I also noticed that while configuring OpenMPI, I see the following in the configure output,
Any ideas on how to debug this warning? |
You may build UCX from GitHub master branch and use it in ompi build |
The warning that I was seeing in my builds was due to an issue on our local cluster (the sysadmins had installed different versions of UCX on login and compute nodes, once that was fixed the warnings are gone). |
Is there a way I can build OpenMPI master with PMI2 support? Apparently the MPI gather tests were running without PMI support (singleton) from slurm since I used the internal OpenMPI PMI (PMIx) to build the library.
|
Afraid not - you'll need to have Slurm configured with PMIx support. |
That is the best option (the implication here is that the upcoming Open MPI v5 will not support PMI1/PMI2). Another option is to use an older version of Open MPI that still has support for PMI2. |
ok, thanks. I am working with the sysadmins here to address the issue. |
Can you backport the changes in PR #8735 and this PR (PR #8769) to a branch off OpenMPI 4.0.4 release? I can use that branch to test the fixes on our cluster.
So if you can backport the required changes (in the 2 PRs) to a branch off OpenMPI 4.0.4 I can test it out in our cluster. |
@jsquyres / @bosilca : Would a backport of these 2 PRs (#8735 + #8769) be possible on OpenMPI 4.0.4? It would be great if I can test the changes with PMI2 on our cluster. |
@jayeshkrishna As you probably saw, we released 4.1.1 with the partial datatype fix. That might be a good solution for you. We're working on back-porting that same partial datatype fix back to the v4.0.x branch -- @hppritcha may have a PR up for that in the next day or so. (update: #8859) We have not discussed bringing back the heterogeneous functionality to either of the 4.x branches. |
Thanks! I will try 4.1.1 and see if it works on our cluster. |
I tried OpenMPI 4.1.1 and it fixes the issues on our cluster! I tried the MPI reproducer on multiple nodes and also tried the E3SM test that was failing with OpenMPI 4.0.4 and both tests pass. |
Great news! Thanks @jayeshkrishna for narrowing down and putting together a small reproducer to get this started. Thanks to everyone in the OpenMPI team, esp. @bosilca for the hard work in fixing this. |
I tried this on master and could not reproduce. Based on that and the previous comments, I believe we can call this issue closed. #8859 has been merged to v4.0.x. |
There seems to be a data corruption issue with OpenMPI 4.0.4 + UCX 1.9.0 (and 4.1.0 + UCX 1.9.0) when communicating data between processes using MPI. The attached test code is an MPI program in C++ that shows the issue when run on more than one node in a local cluster. Restricting the UCX transport protocol to sm, tcp, ud is a workaround for the issue (Not setting UCX_TLS or setting it to rc, dc results in data corruption). The issue does not show up with other MPI libraries (Intel MPI) installed on the system. This issue may be related to #8321
More information on the issue is provided below,
The OpenMPI and UCX install information and the test program is available at https://gist.github.com/jayeshkrishna/5d053d3d5bba11359ea2dc82c435c3ea. On a successful run the test program prints out "SUCCESS: Validation successful". It prints out "ERROR: Validation failed" when data validation fails and prints out the first index in the buffer (gather buffer) where the validation failed.
Building the test program
Note that argparser.[cpp/h] are helper code and you would just need to look at mpi_gather.cpp (the main() and create_snd_rcv_dt() routines).
Running the test program
The issue shows up with certain values (--nvars=13) of "--nvars" on our local cluster with processes across multiple nodes. So I use a loop to test it out,
Please let me know if you need further information on this issue.
The text was updated successfully, but these errors were encountered: