Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken pipe when training a model on CPU #42

Closed
oceank opened this issue Aug 30, 2020 · 11 comments
Closed

Broken pipe when training a model on CPU #42

oceank opened this issue Aug 30, 2020 · 11 comments

Comments

@oceank
Copy link

oceank commented Aug 30, 2020

Hi,

I followed the instructions in README.md to train a A2C agent in DoorKey environment using the following command (Python 3.7.3) in Ubuntu 18.04 with 8 CPUs.

python scripts/train.py --algo a2c --env MiniGrid-DoorKey-5x5-v0 --model DoorKey --save-interval 10 --frames 80000

The train went well initially but ended with a BrokenPipeError exception that crashes the training process. The error message is copied below. According to scripts/train.py, the above command will run with 16 processes. Initially, I thought the error was because the training initialized too many processes. But even when setting --procs=6, the same exception happened again. Only when setting --procs=1, the training ran successfully. Is there any special setting I should do to enable the training with multi-processes?

(Just realized that the error roots in torch_ac)

Error Message

Exception ignored in: <function ParallelEnv.__del__ at 0x7f2df3411a60>
Traceback (most recent call last):
  File "~/torch-ac/torch_ac/utils/penv.py", line 41, in __del__
  File "~/anaconda3/lib/python3.7/multiprocessing/connection.py", line 206, in send
  File "~/anaconda3/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
  File "~/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
BrokenPipeError: [Errno 32] Broken pipe
@lcswillems
Copy link
Owner

lcswillems commented Aug 31, 2020

Thank you @oceank for raising this issue!!

However, I can't reproduce it...
Could you say me at which point in the training does it fail? Is it always at the same point? It seems there is an issue with the parallelization.
Could you also say me which OS do you have? Which version of Python do you have? Do you run it on GPU?

I also don't have time currently to investigate this issue in depth. So if the error persists, I would advise you to try another library.

@oceank
Copy link
Author

oceank commented Aug 31, 2020

Hi, @lcswillems,

I ran the code in Ubuntu 18.04 and Python 3.7.3 without GPU. I can not tell where in the training the error was triggered yet. I will check it out.

@vekt0r-github
Copy link

Hi,

Just adding that I reproduced this with the same command on Ubuntu 18.04.4, Python 3.8.5 without GPU; I believe the broken pipe happens right at the end of training, as the output right before the exception is

U 40 | F 081920 | FPS 0362 | D 294 | rR:μσmM 0.93 0.03 0.81 0.97 | F:μσmM 20.8 8.5 8.0 52.0 | H 1.335 | V 0.807 | pL -0.016 | vL 0.002 | ∇ 0.035
Status saved

which is over 80000 frames. I don't know much about this, but it might suggest that this could be a minor bug (?).

@lcswillems
Copy link
Owner

I am sorry but I can't reproduce the bug... I have tried @oceank command but no problem for me.
If somebody is able to give the command that fails for him/her and his/her configuration, it would be great!

@bharatprakash
Copy link

I recently started getting this error. And it happens right at the end of training.
I didnt get this error before. Not sure what the problem is.

@lcswillems
Copy link
Owner

@bharatprakash What do you mean by "recently"? Did it start yesterday?

@bharatprakash
Copy link

bharatprakash commented Oct 20, 2020

@lcswillems Sorry I should have been more clear.
I have a copy of the repo which a cloned a few months ago which works fine. (both this and torch-ac)

I cloned this repo(and torch-ac) again last week again on the same server for a diff experiment I'm doing. And now I see this error.

I see that this is a new commit on torch-ac and thats where @oceank and I see the error.
lcswillems/torch-ac@64833c6

@lcswillems
Copy link
Owner

Thank you for these details! It could be related to the commit you link but I am not able to reproduce the issues.

Do you have a way to reproduce? If you have one, could you go to the commit before this one: lcswillems/torch-ac@64833c6 and tell me if you still get the error?

@oceank
Copy link
Author

oceank commented Oct 26, 2020

@lcswillems lcswillems/torch-ac@64833c6 works well. The broken pipe error is gone in my local run. Thanks for the help.

@lcswillems
Copy link
Owner

lcswillems commented Nov 1, 2020

@oceank I reverted the commit. Could you tell me if the latest version of torch-ac works fine for you?

@lcswillems
Copy link
Owner

I am closing this issue because I think I fixed the issue. @oceank , if I didn't, please tell me and I will reopen the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants