New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4/ofi: Segmentation fault inside MPI_Alltoallw call #6341
Comments
Yes, please share the steps to reproduce the bug. |
@hzhou Since it might have been fixed by MPICH 4.1b1, do you know which PR has the fix? Below are my steps to reproduce the segmentation fault on ANL GCE nodes (Ubuntu 20 with default GCC 9.4.0). [Build MPICH 4.1a1 with -g flag and ch4:ofi device]
[Use custom MPICH build] [Build PnetCDF lib required by SCORPIO]
[Build and run a C example of SCORPIO]
|
@skosukhin found a similar issue using v4.0.3 with yaksa (not with dataloop). It also causes a segmentation vault in the routine Unfortunately, I was not able to generate a simple reproducer... Do you maybe have such a reproducer, which we could use in our configure to check for a usable MPI? |
@k202077 Hmm, the mentioned patch restores a shortcut for a contiguous datatype. Thus, the potential bug may still exist. Could you describe how to reproduce your original segfault? If the reproducer is not simple, we may not be able to get to it right away, but it is still useful for reference. |
I have reproduced the bug following #6341 (comment) and confirmed the same patch b0cef14 fixes the issue. I'll dig a bit deeper and update. |
The bug was in https://github.com/hzhou/mpich/blob/84902a51df5330d979668900e9f3ab7b567cfe1b/src/mpi/datatype/typerep/src/typerep_yaksa_pack.c#L147 due to a typo -- The bug was introduced in a338d6a (which fixes a separate previous bug but unfortunately with a typo), and it was fixed b0cef14. Thus the affected MPICH releases are from |
I am closing this issue since it is fully investigated. If you need assistance on patching or configure check, please feel free to reopen and comment. |
@k202077 and I came up with a reproducer which we will use to detect defective versions. I'm sharing it here so others can quickly detect if their MPI installation is affected:
|
We have a test case which calls MPI_Alltoallw a few times, and the last MPI_Alltoallw call has segmentation fault (signal 11).
MPICH versions: 4.0, 4.0.1, 4.0.2, 4.0.3, 4.1a1 (not reproducible with 4.1b1)
MPICH Device: ch4:ofi (not reproducible with ch3:nemesis)
Environment: ANL GCE Linux workstations (Ubuntu 20, GCC 9.4), ANL JLSE nodes (Intel OneAPI, aurora mpich module based on MPICH 4.1a1)
Call stack generated with Address Sanitizer (MPICH 4.1a1):
@hzhou Is this a known issue to you? It seems that this issue might have been fixed in latest MPICH 4.1b1.
If necessary, I can provide detailed steps for you to run this test case on ANL GCE workstations.
The text was updated successfully, but these errors were encountered: