-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
romio: corruption with a simple test using ADIOS2 BP5 engine #6182
Comments
@roblatham00 This is the simplest reproducer I have so far, which still requires linking to ADIOS2 lib. I will try to replace the ADIOS2 code in the test program with all-MPI code if I can.
|
Valgrind output on ANL GCE nodes (Ubuntu 20, GCC 9.4.0)
|
@dqwu Could you try reproducing this with 4.0.x romio + 3.4.x mpich or vice versa? I think you can do this by manually swapping the Is the creating of |
Thanks for the suggestion, and I will test it later. Yes, the empty file created with fh1 is required to reproduce this issue. |
@hzhou
mplconfig.h can be found in the src directory, though.
|
Let's try some hacking --
|
This workaround works, but there are some linking errors:
|
More hacking:
|
Thanks, I was able to build MPICH 4.0.2 with ROMIO of MPICH 3.4.3, and this issue is not reproducible. I will try MPICH 3.4.3 with ROMIO of MPICH 4.0.2 later. |
I guess that points the fault to the changes in ROMIO |
I tried to build MPICH 3.4.3 with ROMIO of MPICH 4.0.2, the build errors are:
Any hacking for this build issue? |
Yeah, paste in
|
Thanks, this works to build MPICH 3.4.3 with ROMIO of MPICH 4.0.2. Unfortunately, MPICH 3.4.3 with ROMIO of MPICH 4.0.2 cannot reproduce this issue. In summary (tested on my laptop with Ubuntu 18 and GCC 7.4.0): |
@hzhou @roblatham00 MPICH 4.0b1 + default ROMIO: not reproducible This points the fault to the changes in ROMIO (between 4.0b1 and 4.0rc1, number of different files: 16). It seems that this issue is caused by PR #5660. For MPICH 4.0.2, if I revert ADIOI_COLL_TAG macros (constant tag) back to myrank + i + 100 * iter or myrank + p + 100 * iter in ad_write_coll.c, this issue is not reproducible. |
@hzhou @roblatham00
For the 2nd comm.Irecv call, the argument "rank - 1" should be "comm.Size() - 1" (the source parameter of MPI_Irecv). Due to this bug (maybe a typo), rank 0 will not receive the integer sendToken from rank 3 (the value is set to 3 on rank 3), and this pending data will be received later inside the ROMIO write function (on rank 3, the first int value written is always fixed as 3, which is the previous sendToken on rank 3 lost inside ADIOS2 BP5 engine code). |
@hzhou @roblatham00 I think this issue can be closed later, but it will be helpful that newer versions of MPICH can detect an invalid source argument (e.g. -1) in MPI_Irecv and maybe give users a warning.
|
@dqwu Good job locating the bug!
Unfortunately |
Thanks for the information. As you can see, this bug is not easy to reproduce with MPICH 4.x. Actually, if we delete dupcomm1, dupcomm2, or dupcomm3 (any of them) in the latest test program, this bug cannot be reproduced (the ADIOS2 BP5 engine code creates a lot of MPI comms). |
@hzhou @roblatham00 |
@dqwu Thanks for the all-MPI reproducer. It did reveal an issue in mpich. What happens is -- there is a pending send message in I am not sure how to fix this yet, but ideally traffic from different communicators should never interfere. |
Close this issue. The lingering send message issue is tracked in new issue #6184 |
This issue is only reproducible with MPICH 4.x (not reproducible with MPICH 3.x or OpenMPI).
Below are detailed steps to reproduce it.
[Build and install MPICH 4.0.2]
[Add installed MPICH to $PATH]
export PATH=/path/to/mpich/installation/bin:$PATH
[Build and install ADIOS2 2.8.3]
[Build and run the test program]
[Example output]
Content of the test program (test_adios_mpiio.c)
The text was updated successfully, but these errors were encountered: