-
Notifications
You must be signed in to change notification settings - Fork 844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional segfaults and hangs when running multiple MPI jobs on one node #7642
Comments
Another hang! This time two MPI jobs crashed.
Second crashed job's backtrace (printed by 1/2 ranks):
I don't know how to translate these backtraces to code lines since I do not not the addresses of the shared objects that the chrashed processes used. Any hints how to do this? Again, there is an additional MPI job that hangs forever. It has two ranks that are waiting for progress at different locations, i.e., they are out-of-sync and hence deadlocked.
Rank 1`s backtrace of hung job taken at a random time:
This job is at an really early stage where the initial commincators are set up shortly after Btw. additional independent MPI jobs run after these jobs crashed or hung. |
Can you provide simple reproducers for either or both of these problems? Per your question above: there isn't a good way to map the addresses to individual line numbers. If you are compiling Open MPI yourself, you can add in |
Unfortunately, I cannot provide a reproducer since the problem only occurs every few days while everything works fine in the meantime. IMHO the vader BTL probably runs into a race condition when multiple MPI jobs are running simultaneously. However, I also tried to explicitly stress the system with many simultaneous jobs to trigger the error, but this was not successful. Now, I will try, if can reproduce it with a trivial problem. |
Here are some new backtraces from another crash. This time the application was instrumented with gcc's address sanitizer (OpenMPI was not sanitized). The sanitizer printed a resolved backtrace of the crashing jobs. Though, this time no jobs hang. I had to demux the output from the MPI processes and ASAN unfortunately does not print the process ID in every line, so the backtraces might be demuxed in a wrong way. However, I think it is correct and also it should not make a big difference. First concurrent MPI job
Second concurrent MPI job
|
Still an issue with 4.0.4! Disabling vader and using tcp BTL fixes it, but this is not a nice solution. |
I am not sure, if it is really related, but I observed occasional errors in a minimal example when using vader BTL. Here is how to reproduce: #include <mpi.h>
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
MPI_Finalize();
} Compile and run many instances in a loop: This results in occasional errors:
When using only TCP and self BTL I observed no such errors so far: |
It seems like the error triggered by the minimal reproducer is fixed in 4.1.0. However, it is still an issue for 4.0.5. |
4.0.6rc1 also seems to fix the error for the minimal reproducer. Possibly due to the PMIx update? I will try to install 4.1.0 on out CI build system to see, if the errors in production are also fixed. Since they used to happen not too frequently, I have to wait some weeks to be able to assess if the issue is solved. |
@jeorsch Is this issue resolved for you? |
@jsquyres, until now, it is definitely much more robust. I just noticed one strange hang in the last two month, but I it seems to be related to something else. If I have any new findings, I will reopen this issue or write a new one. |
Thank you! |
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.0.3 (also occured with v4.0.0)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From a tarball with "--prefix", "--enable-debug", and "--with-hwloc=internal" configure parameters.
Please describe the system on which you are running
Details of the problem
We run a test suite that involves many parallel MPI jobs on a single node. The jobs are started by
mpirun --bind-to none -use-hwthread-cpus -oversubscribe
, since this gives us the best throughput. However, occasionally a single MPI job exits with a segmentation fault and another job hangs at the same time. I don't think that it is releated to my application, since the errorneous behaviour happens only every few days or weeks, while most runs pass flawlessly. Furthermore, the test suite jobs that fail/hang are different each time. OpenMPI 1.10.2 that we used before never had that problem.Here is a corresponding backtrace for a job with two ranks that crashed. I had to strip confidential content (hostname and application backtrace).
First process:
Second process:
Additionally, an independent MPI jobs hangs forever. Here are the backtraces from all threads that are related to such a hanging process:
I can try to provide some more information, if requested, but I will have to wait until such a hang happens again. As I said it usually takes some days until a CI pipeline gets stuck by this.
The text was updated successfully, but these errors were encountered: