-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python train.py gives a CalledProcessError #5
Comments
Have you tried running ESTool with simple experiment to see if MPI is
installed ok?
Also I think it is configured for 64 core machine. If you are using less
cores pass in a flag to specify (instructions in ESTool or blogs)
…On Tue, Jul 17, 2018 at 2:01 PM Sankalp Sanand ***@***.***> wrote:
When I run python train.py on the specified CPU system I get a very long
error message ending with,
Traceback (most recent call last): File "train.py", line 450, in <module>
if "parent" == mpi_fork(args.num_worker+1): os.exit() File "train.py", line
424, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n),
sys.executable] +['-u']+ sys.argv, env=env) File
"/home/neptune/anaconda3/lib/python3.5/subprocess.py", line 581, in
check_call raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpirun', '-np', '65',
'/home/neptune/anaconda3/bin/python', '-u', 'train.py']' returned non-zero
exit status 134
I searched for the exit status for mpirun but wasn't able to debug the
issue.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGBoHoSDYbqfrbO2I9Rw7s9cpg9Vr8WYks5uHW-OgaJpZM4VSL4G>
.
|
Yes, I've tried running ESTool with a simple experiment from your stool repo using |
Could be related to this: Ensure you've only got one MPI library on your machine (i.e. try running this if you're on Linux) If you have multiple MPI's then comm.Get_size() returns 1, so the following assert statement fails |
Tried that but it opens up a new box of errors like Interestingly though, when I tried changing the number of cores from 64 to 32 or 24 by executing, I guess the issue just comes when we use 64 cores (which is odd) |
Numbers of workers has to be less than the number of cores - how many cores have you got? Try uninstalling open MPI and instead install mpich sudo apt-get install mpich |
I. I've tried the following combinations which seemed to work (not uninstalling openmpi):
II. Which did not work include:
Also, I'm using Anaconda 4.2 in all of my experiments because Python 3.6 was causing issues with boost libraries. |
Do you get the same problems with the car racing task or is it just doom? |
I don't know about a 64 core proc, but for 24 core
were you able to reproduce this error? I think that this issue is caused by the error in any of the threads while executing them. When I carefully observed this I found that there were different reasons, then one time I got this in between a whole screen of text, So, I guess this is being caused due to dependency issues(the same one over all the threads). |
Now I've come across a new error when I created a completely new instance and did the installation as mentioned above then executed
I guess all of the other errors were resolved by doing a clean installation in that order. |
I'm not sure how to resolve this to be honest. The only diff I see is the python version I used (3.5.2) I ran train.py today on a fresh machine (to check another issue on another thread) for ~ half a day and it seemed to work on my machine: https://github.com/hardmaru/WorldModelsExperiments/blob/master/doomrnn/trainlog/train.log.txt |
@hardmaru thank you. Even I don't understand why is this happening, we both are using the same Anaconda distribution (python 3.5.2). I guess I'll close this issue until someone comes across it again. |
hello while I am running train.py Igot this error can someone help me please |
When I run python train.py on the specified CPU system I get a very long error message ending with,
Traceback (most recent call last): File "train.py", line 450, in <module> if "parent" == mpi_fork(args.num_worker+1): os.exit() File "train.py", line 424, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/home/neptune/anaconda3/lib/python3.5/subprocess.py", line 581, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '65', '/home/neptune/anaconda3/bin/python', '-u', 'train.py']' returned non-zero exit status 134
I searched for the exit status for mpirun but wasn't able to debug the issue.
The text was updated successfully, but these errors were encountered: