Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpirun may hung indefinitelly if process exits with non-zero exit code #3380

Closed
lvella opened this issue Apr 19, 2017 · 7 comments
Closed

mpirun may hung indefinitelly if process exits with non-zero exit code #3380

lvella opened this issue Apr 19, 2017 · 7 comments
Labels

Comments

@lvella
Copy link

lvella commented Apr 19, 2017

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

(Open MPI) 1.10.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From Ubuntu 16.10 repository.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 16.10
  • Computer hardware: Intel i5
  • Network type: No network, local test

Details of the problem

MPI may fail to quit when one process exits with non-zero exit code. Consider the sample code:

#include <mpi.h>
#include <stdlib.h>

void cleanup()
{
	MPI_Finalize();
}

int main(int argc, char *argv[])
{
	if(argc < 2) {
		return 0;
	}
	MPI_Init(&argc, &argv);
	atexit(cleanup);

	int i = atoi(argv[1]);
	return i;
}

If I run with mpirun -n 4 ./test 0, as expected, the job quits cleanly. But if I run wih mpirun -n 4 ./test 1, there is a significant chance the job will hang indefinitely, after displaying the following message:

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------

By running mpirun -n 4 ./test 1 ten times, mpirun hung once while killing the processes. (By the way, where is the setting that makes OpenMPI abort the job immediately if one process return a nonzero exit code?)

@jsquyres
Copy link
Member

I'm not sure that we define what happens if you put MPI_Finalize() in an atexit() call.

What happens if you remove the call to atexit()?

@lvella
Copy link
Author

lvella commented Apr 20, 2017 via email

@artpol84
Copy link
Contributor

FYI:
Some time ago we observed and fixed a problem that sound similar. It was for master/v2.x: #2675.
As I recall the codebase in v1.10 diverged significantly and it was impossible to back-port the patch there. And I assumed that this problem was introduced later.

In this context problem may have something to do with similar effects as atexit is involved.

@lvella
Copy link
Author

lvella commented May 3, 2017

It is not atexit() fault (nor I didn't see any reason for it to be, because it simply ensure MPI_Finalize() is called on every process). Following modified code is equivalent to original, without using atexit() and exhibiting the same symptoms:

#include <mpi.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
	if(argc < 2) {
		return 0;
	}
	MPI_Init(&argc, &argv);
	//atexit(cleanup);

	int i = atoi(argv[1]);

	MPI_Finalize();
	return i;
}

I do believe it is related to the mechanism involved in killing other process if MPI detects one process exited with non-zero exit status, this mechanism seems unable to handle a process which properly calls MPI_Finalize() instead of exiting abruptly.

@rhc54
Copy link
Contributor

rhc54 commented May 29, 2017

I verified this is working correctly on master and v3.x (with #3462):

$ mpirun -npernode 2 ./nonzero 2
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[64873,1],2]
  Exit code:    2
--------------------------------------------------------------------------

I'd advise updating once v3.0.0 is released as the v1.10 series has been "closed".

@Platinumd1991
Copy link

Hi everyone, can someone help me with tbhis problem? I'm trying to run the turbPipe examples in Nek5000 v19 that ois available on my personal computer but I get this error everytime I use Nekbmpi, the funny thing is that it works for 2D models. I did a small test to see if this error pops up using a different 3D examples (such as TurbJet) and same thing happens. I am not sure if that's because of my PC not being able to take 3D simulation or is there another issue to this?

Error is:

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been abprted.

Mpiexec detected tjhat one or more processes exited with Non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[46568,1]],0]
Exit code:

@jsquyres
Copy link
Member

@Platinumd1991 please do not comment on a closed 2-year old bug report asking about a different problem than was originally cited in the issue. For new questions, please open a new issue (or post to the Open MPI users mailing list; see https://www.open-mpi.org/community/lists/ompi.php). Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants