Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python train.py gives a CalledProcessError #5

Closed
kessler-frost opened this issue Jul 17, 2018 · 12 comments
Closed

python train.py gives a CalledProcessError #5

kessler-frost opened this issue Jul 17, 2018 · 12 comments

Comments

@kessler-frost
Copy link

When I run python train.py on the specified CPU system I get a very long error message ending with,
Traceback (most recent call last): File "train.py", line 450, in <module> if "parent" == mpi_fork(args.num_worker+1): os.exit() File "train.py", line 424, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/home/neptune/anaconda3/lib/python3.5/subprocess.py", line 581, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '65', '/home/neptune/anaconda3/bin/python', '-u', 'train.py']' returned non-zero exit status 134
I searched for the exit status for mpirun but wasn't able to debug the issue.

@hardmaru
Copy link
Owner

hardmaru commented Jul 17, 2018 via email

@kessler-frost
Copy link
Author

kessler-frost commented Jul 17, 2018

Yes, I've tried running ESTool with a simple experiment from your stool repo using python train.py bullet_racecar -n 8 -t 4 it was running without any issue/error. I even tried python train.py bullet_ant -e 16 -n 64 -t 4 after installing pybullet and it too ran successfully. But still was unable to perform the same on doom. And yeah, I am using a 64 core machine with 200GB RAM on gcloud for all of the experiments, just as you mentioned in the blog post.

@davidADSP
Copy link

davidADSP commented Jul 17, 2018

Could be related to this:
AppliedDataSciencePartners/WorldModels#3

Ensure you've only got one MPI library on your machine (i.e. try running this if you're on Linux)
sudo apt-get remove openmpi-bin

If you have multiple MPI's then comm.Get_size() returns 1, so the following assert statement fails
num_worker = comm.Get_size()
assert len(packet_list) == num_worker-1

@kessler-frost
Copy link
Author

Tried that but it opens up a new box of errors like
FileNotFoundError: [Errno 2] No such file or directory: 'mpirun'
or
lib12.so was not found and something like that.

Interestingly though, when I tried changing the number of cores from 64 to 32 or 24 by executing,
python train.py -n 32
It started giving me the right thing,
('doomrnn', (1, 35, 269.67, 149.75, 480.81, 69.88, 0.09914, 269.67, 480))

I guess the issue just comes when we use 64 cores (which is odd)

@davidADSP
Copy link

Numbers of workers has to be less than the number of cores - how many cores have you got?

Try uninstalling open MPI and instead install mpich

sudo apt-get install mpich

@kessler-frost
Copy link
Author

kessler-frost commented Jul 18, 2018

I. I've tried the following combinations which seemed to work (not uninstalling openmpi):

  1. 64 Core proc, python train.py -n 24 or python train.py 32
  2. 24 Core proc, python train.py -n 24

II. Which did not work include:
with openmpi -

  1. 64 Core proc, python train.py
    with mpich -
  2. 64 Core proc, python train.py
  3. 64 Core proc, python train.py -n 32

Also, I'm using Anaconda 4.2 in all of my experiments because Python 3.6 was causing issues with boost libraries.
I'd suggest if it's possible for someone to perform a clean installation of all the project dependencies on a 64 core machine then they should try the solution by @davidADSP as I've exhausted all of my gcloud credits and am stuck with a 24 core one with a new account.

@davidADSP
Copy link

Do you get the same problems with the car racing task or is it just doom?

@kessler-frost
Copy link
Author

kessler-frost commented Jul 18, 2018

I don't know about a 64 core proc, but for 24 core python train.py -n 24 executes successfully for car racing task. For a while this issue was also present when using 24 core processor but I was able to work around that by installing stuff in this particular order,
pip install tensorflow==1.8 gym==0.9.4 cma==2.2

conda install libgcc

apt-get install -y python-numpy cmake zlib1g-dev libjpeg-dev libboost-all-dev gcc libsdl2-dev wget unzip git

pip install mpi4py==2

pip install ppaquette-gym-doom

were you able to reproduce this error?

I think that this issue is caused by the error in any of the threads while executing them. When I carefully observed this I found that there were different reasons,
for example one time I got the same AssertionError you referenced,
Traceback (most recent call last): File "05_train_controller.py", line 461, in <module> main(args) File "05_train_controller.py", line 410, in main master() File "05_train_controller.py", line 319, in master send_packets_to_slaves(packet_list) File "05_train_controller.py", line 233, in send_packets_to_slaves assert len(packet_list) == num_worker-1 AssertionError

then one time I got this in between a whole screen of text,
ImportError: libXft.so.2: cannot open shared object file: No such file or directory

So, I guess this is being caused due to dependency issues(the same one over all the threads).

@kessler-frost
Copy link
Author

kessler-frost commented Jul 18, 2018

Now I've come across a new error when I created a completely new instance and did the installation as mentioned above then executed python train.py and this occurred ,

RuntimeError: can't start new thread

I guess all of the other errors were resolved by doing a clean installation in that order.

@hardmaru
Copy link
Owner

Hi @kessler-frost

I'm not sure how to resolve this to be honest. The only diff I see is the python version I used (3.5.2)

I ran train.py today on a fresh machine (to check another issue on another thread) for ~ half a day and it seemed to work on my machine:

https://github.com/hardmaru/WorldModelsExperiments/blob/master/doomrnn/trainlog/train.log.txt

@kessler-frost
Copy link
Author

@hardmaru thank you. Even I don't understand why is this happening, we both are using the same Anaconda distribution (python 3.5.2). I guess I'll close this issue until someone comes across it again.

@Tar12 Tar12 mentioned this issue Feb 20, 2019
Closed
@Antonio-git-lab
Copy link

hello while I am running train.py Igot this error can someone help me please
File "c:\Users\User\Desktop\GIT\WorldModelsExperiments-master\carracing\train.py", line 445, in
if "parent" == mpi_fork(args.num_worker+1): os._exit()
File "c:\Users\User\Desktop\GIT\WorldModelsExperiments-master\carracing\train.py", line 419, in mpi_fork
subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 266, in check_call
retcode = call(*popenargs, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 247, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 676, in init
restore_signals, start_new_session)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 957, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants