python train.py gives a CalledProcessError #5

kessler-frost · 2018-07-17T05:01:02Z

When I run python train.py on the specified CPU system I get a very long error message ending with,
Traceback (most recent call last): File "train.py", line 450, in <module> if "parent" == mpi_fork(args.num_worker+1): os.exit() File "train.py", line 424, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/home/neptune/anaconda3/lib/python3.5/subprocess.py", line 581, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '65', '/home/neptune/anaconda3/bin/python', '-u', 'train.py']' returned non-zero exit status 134
I searched for the exit status for mpirun but wasn't able to debug the issue.

The text was updated successfully, but these errors were encountered:

hardmaru · 2018-07-17T05:36:11Z

Have you tried running ESTool with simple experiment to see if MPI is installed ok? Also I think it is configured for 64 core machine. If you are using less cores pass in a flag to specify (instructions in ESTool or blogs)

…

On Tue, Jul 17, 2018 at 2:01 PM Sankalp Sanand ***@***.***> wrote: When I run python train.py on the specified CPU system I get a very long error message ending with, Traceback (most recent call last): File "train.py", line 450, in <module> if "parent" == mpi_fork(args.num_worker+1): os.exit() File "train.py", line 424, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/home/neptune/anaconda3/lib/python3.5/subprocess.py", line 581, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '65', '/home/neptune/anaconda3/bin/python', '-u', 'train.py']' returned non-zero exit status 134 I searched for the exit status for mpirun but wasn't able to debug the issue. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGBoHoSDYbqfrbO2I9Rw7s9cpg9Vr8WYks5uHW-OgaJpZM4VSL4G> .

kessler-frost · 2018-07-17T06:19:34Z

Yes, I've tried running ESTool with a simple experiment from your stool repo using python train.py bullet_racecar -n 8 -t 4 it was running without any issue/error. I even tried python train.py bullet_ant -e 16 -n 64 -t 4 after installing pybullet and it too ran successfully. But still was unable to perform the same on doom. And yeah, I am using a 64 core machine with 200GB RAM on gcloud for all of the experiments, just as you mentioned in the blog post.

davidADSP · 2018-07-17T12:23:15Z

Could be related to this:
AppliedDataSciencePartners/WorldModels#3

Ensure you've only got one MPI library on your machine (i.e. try running this if you're on Linux)
sudo apt-get remove openmpi-bin

If you have multiple MPI's then comm.Get_size() returns 1, so the following assert statement fails
num_worker = comm.Get_size()
assert len(packet_list) == num_worker-1

kessler-frost · 2018-07-17T19:24:50Z

Tried that but it opens up a new box of errors like
FileNotFoundError: [Errno 2] No such file or directory: 'mpirun'
or
lib12.so was not found and something like that.

Interestingly though, when I tried changing the number of cores from 64 to 32 or 24 by executing,
python train.py -n 32
It started giving me the right thing,
('doomrnn', (1, 35, 269.67, 149.75, 480.81, 69.88, 0.09914, 269.67, 480))

I guess the issue just comes when we use 64 cores (which is odd)

davidADSP · 2018-07-17T20:53:54Z

Numbers of workers has to be less than the number of cores - how many cores have you got?

Try uninstalling open MPI and instead install mpich

sudo apt-get install mpich

kessler-frost · 2018-07-18T04:23:36Z

I. I've tried the following combinations which seemed to work (not uninstalling openmpi):

64 Core proc, python train.py -n 24 or python train.py 32
24 Core proc, python train.py -n 24

II. Which did not work include:
with openmpi -

64 Core proc, python train.py
with mpich -
64 Core proc, python train.py
64 Core proc, python train.py -n 32

Also, I'm using Anaconda 4.2 in all of my experiments because Python 3.6 was causing issues with boost libraries.
I'd suggest if it's possible for someone to perform a clean installation of all the project dependencies on a 64 core machine then they should try the solution by @davidADSP as I've exhausted all of my gcloud credits and am stuck with a 24 core one with a new account.

davidADSP · 2018-07-18T07:32:48Z

Do you get the same problems with the car racing task or is it just doom?

kessler-frost · 2018-07-18T09:09:33Z

I don't know about a 64 core proc, but for 24 core python train.py -n 24 executes successfully for car racing task. For a while this issue was also present when using 24 core processor but I was able to work around that by installing stuff in this particular order,
pip install tensorflow==1.8 gym==0.9.4 cma==2.2

conda install libgcc

apt-get install -y python-numpy cmake zlib1g-dev libjpeg-dev libboost-all-dev gcc libsdl2-dev wget unzip git

pip install mpi4py==2

pip install ppaquette-gym-doom

were you able to reproduce this error?

I think that this issue is caused by the error in any of the threads while executing them. When I carefully observed this I found that there were different reasons,
for example one time I got the same AssertionError you referenced,
Traceback (most recent call last): File "05_train_controller.py", line 461, in <module> main(args) File "05_train_controller.py", line 410, in main master() File "05_train_controller.py", line 319, in master send_packets_to_slaves(packet_list) File "05_train_controller.py", line 233, in send_packets_to_slaves assert len(packet_list) == num_worker-1 AssertionError

then one time I got this in between a whole screen of text,
ImportError: libXft.so.2: cannot open shared object file: No such file or directory

So, I guess this is being caused due to dependency issues(the same one over all the threads).

kessler-frost · 2018-07-18T13:47:58Z

Now I've come across a new error when I created a completely new instance and did the installation as mentioned above then executed python train.py and this occurred ,

RuntimeError: can't start new thread

I guess all of the other errors were resolved by doing a clean installation in that order.

hardmaru · 2018-07-21T00:11:09Z

Hi @kessler-frost

I'm not sure how to resolve this to be honest. The only diff I see is the python version I used (3.5.2)

I ran train.py today on a fresh machine (to check another issue on another thread) for ~ half a day and it seemed to work on my machine:

https://github.com/hardmaru/WorldModelsExperiments/blob/master/doomrnn/trainlog/train.log.txt

kessler-frost · 2018-07-21T06:34:22Z

@hardmaru thank you. Even I don't understand why is this happening, we both are using the same Anaconda distribution (python 3.5.2). I guess I'll close this issue until someone comes across it again.

Antonio-git-lab · 2020-03-26T09:23:11Z

hello while I am running train.py Igot this error can someone help me please
File "c:\Users\User\Desktop\GIT\WorldModelsExperiments-master\carracing\train.py", line 445, in
if "parent" == mpi_fork(args.num_worker+1): os._exit()
File "c:\Users\User\Desktop\GIT\WorldModelsExperiments-master\carracing\train.py", line 419, in mpi_fork
subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 266, in check_call
retcode = call(*popenargs, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 247, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 676, in init
restore_signals, start_new_session)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 957, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

kessler-frost closed this as completed Jul 21, 2018

Tar12 mentioned this issue Feb 20, 2019

t1 #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python train.py gives a CalledProcessError #5

python train.py gives a CalledProcessError #5

kessler-frost commented Jul 17, 2018

hardmaru commented Jul 17, 2018 via email

kessler-frost commented Jul 17, 2018 •

edited

Loading

davidADSP commented Jul 17, 2018 •

edited

Loading

kessler-frost commented Jul 17, 2018

davidADSP commented Jul 17, 2018

kessler-frost commented Jul 18, 2018 •

edited

Loading

davidADSP commented Jul 18, 2018

kessler-frost commented Jul 18, 2018 •

edited

Loading

kessler-frost commented Jul 18, 2018 •

edited

Loading

hardmaru commented Jul 21, 2018

kessler-frost commented Jul 21, 2018

Antonio-git-lab commented Mar 26, 2020

python train.py gives a CalledProcessError #5

python train.py gives a CalledProcessError #5

Comments

kessler-frost commented Jul 17, 2018

hardmaru commented Jul 17, 2018 via email

kessler-frost commented Jul 17, 2018 • edited Loading

davidADSP commented Jul 17, 2018 • edited Loading

kessler-frost commented Jul 17, 2018

davidADSP commented Jul 17, 2018

kessler-frost commented Jul 18, 2018 • edited Loading

davidADSP commented Jul 18, 2018

kessler-frost commented Jul 18, 2018 • edited Loading

kessler-frost commented Jul 18, 2018 • edited Loading

hardmaru commented Jul 21, 2018

kessler-frost commented Jul 21, 2018

Antonio-git-lab commented Mar 26, 2020

kessler-frost commented Jul 17, 2018 •

edited

Loading

davidADSP commented Jul 17, 2018 •

edited

Loading

kessler-frost commented Jul 18, 2018 •

edited

Loading

kessler-frost commented Jul 18, 2018 •

edited

Loading

kessler-frost commented Jul 18, 2018 •

edited

Loading