New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpirun may hung indefinitelly if process exits with non-zero exit code #3380
Comments
I'm not sure that we define what happens if you put What happens if you remove the call to |
It is possible to write equivalent code without atexit(), just call
MPI_Finalize() just before return. As soon as I get to the machine I'll
test it.
Em 19 de abr de 2017 21:30, "Jeff Squyres" <notifications@github.com>
escreveu:
… I'm not sure that we define what happens if you put MPI_Finalize() in an
atexit() call.
What happens if you remove the call to atexit()?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3380 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA-jtrGsLoyiJP4u1lFHWPCb63GJT00zks5rxqcigaJpZM4NCSnC>
.
|
FYI: In this context problem may have something to do with similar effects as |
It is not #include <mpi.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
if(argc < 2) {
return 0;
}
MPI_Init(&argc, &argv);
//atexit(cleanup);
int i = atoi(argv[1]);
MPI_Finalize();
return i;
} I do believe it is related to the mechanism involved in killing other process if MPI detects one process exited with non-zero exit status, this mechanism seems unable to handle a process which properly calls |
I verified this is working correctly on master and v3.x (with #3462):
I'd advise updating once v3.0.0 is released as the v1.10 series has been "closed". |
Hi everyone, can someone help me with tbhis problem? I'm trying to run the turbPipe examples in Nek5000 v19 that ois available on my personal computer but I get this error everytime I use Nekbmpi, the funny thing is that it works for 2D models. I did a small test to see if this error pops up using a different 3D examples (such as TurbJet) and same thing happens. I am not sure if that's because of my PC not being able to take 3D simulation or is there another issue to this? Error is: Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been abprted. Mpiexec detected tjhat one or more processes exited with Non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[46568,1]],0] |
@Platinumd1991 please do not comment on a closed 2-year old bug report asking about a different problem than was originally cited in the issue. For new questions, please open a new issue (or post to the Open MPI users mailing list; see https://www.open-mpi.org/community/lists/ompi.php). Thanks. |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
(Open MPI) 1.10.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From Ubuntu 16.10 repository.
Please describe the system on which you are running
Details of the problem
MPI may fail to quit when one process exits with non-zero exit code. Consider the sample code:
If I run with
mpirun -n 4 ./test 0
, as expected, the job quits cleanly. But if I run wihmpirun -n 4 ./test 1
, there is a significant chance the job will hang indefinitely, after displaying the following message:By running
mpirun -n 4 ./test 1
ten times, mpirun hung once while killing the processes. (By the way, where is the setting that makes OpenMPI abort the job immediately if one process return a nonzero exit code?)The text was updated successfully, but these errors were encountered: