Broken pipe when training a model on CPU #42

oceank · 2020-08-30T03:26:37Z

Hi,

I followed the instructions in README.md to train a A2C agent in DoorKey environment using the following command (Python 3.7.3) in Ubuntu 18.04 with 8 CPUs.

python scripts/train.py --algo a2c --env MiniGrid-DoorKey-5x5-v0 --model DoorKey --save-interval 10 --frames 80000

The train went well initially but ended with a BrokenPipeError exception that crashes the training process. The error message is copied below. According to scripts/train.py, the above command will run with 16 processes. Initially, I thought the error was because the training initialized too many processes. But even when setting --procs=6, the same exception happened again. Only when setting --procs=1, the training ran successfully. Is there any special setting I should do to enable the training with multi-processes?

(Just realized that the error roots in torch_ac)

Error Message

Exception ignored in: <function ParallelEnv.__del__ at 0x7f2df3411a60>
Traceback (most recent call last):
  File "~/torch-ac/torch_ac/utils/penv.py", line 41, in __del__
  File "~/anaconda3/lib/python3.7/multiprocessing/connection.py", line 206, in send
  File "~/anaconda3/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
  File "~/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
BrokenPipeError: [Errno 32] Broken pipe

The text was updated successfully, but these errors were encountered:

lcswillems · 2020-08-31T21:05:53Z

Thank you @oceank for raising this issue!!

However, I can't reproduce it...
Could you say me at which point in the training does it fail? Is it always at the same point? It seems there is an issue with the parallelization.
Could you also say me which OS do you have? Which version of Python do you have? Do you run it on GPU?

I also don't have time currently to investigate this issue in depth. So if the error persists, I would advise you to try another library.

oceank · 2020-08-31T21:23:12Z

Hi, @lcswillems,

I ran the code in Ubuntu 18.04 and Python 3.7.3 without GPU. I can not tell where in the training the error was triggered yet. I will check it out.

vekt0r-github · 2020-09-11T18:54:35Z

Hi,

Just adding that I reproduced this with the same command on Ubuntu 18.04.4, Python 3.8.5 without GPU; I believe the broken pipe happens right at the end of training, as the output right before the exception is

U 40 | F 081920 | FPS 0362 | D 294 | rR:μσmM 0.93 0.03 0.81 0.97 | F:μσmM 20.8 8.5 8.0 52.0 | H 1.335 | V 0.807 | pL -0.016 | vL 0.002 | ∇ 0.035
Status saved

which is over 80000 frames. I don't know much about this, but it might suggest that this could be a minor bug (?).

lcswillems · 2020-09-26T12:11:15Z

I am sorry but I can't reproduce the bug... I have tried @oceank command but no problem for me.
If somebody is able to give the command that fails for him/her and his/her configuration, it would be great!

bharatprakash · 2020-10-18T18:35:17Z

I recently started getting this error. And it happens right at the end of training.
I didnt get this error before. Not sure what the problem is.

lcswillems · 2020-10-19T06:49:21Z

@bharatprakash What do you mean by "recently"? Did it start yesterday?

bharatprakash · 2020-10-20T17:24:18Z

@lcswillems Sorry I should have been more clear.
I have a copy of the repo which a cloned a few months ago which works fine. (both this and torch-ac)

I cloned this repo(and torch-ac) again last week again on the same server for a diff experiment I'm doing. And now I see this error.

I see that this is a new commit on torch-ac and thats where @oceank and I see the error.
lcswillems/torch-ac@64833c6

lcswillems · 2020-10-20T21:37:01Z

Thank you for these details! It could be related to the commit you link but I am not able to reproduce the issues.

Do you have a way to reproduce? If you have one, could you go to the commit before this one: lcswillems/torch-ac@64833c6 and tell me if you still get the error?

oceank · 2020-10-26T18:02:53Z

@lcswillems lcswillems/torch-ac@64833c6 works well. The broken pipe error is gone in my local run. Thanks for the help.

lcswillems · 2020-11-01T18:21:03Z

@oceank I reverted the commit. Could you tell me if the latest version of torch-ac works fine for you?

lcswillems · 2021-08-17T20:52:36Z

I am closing this issue because I think I fixed the issue. @oceank , if I didn't, please tell me and I will reopen the issue.

lcswillems mentioned this issue Nov 1, 2020

adds explicit killing of workers lcswillems/torch-ac#4

Merged

lcswillems mentioned this issue Jan 22, 2021

Revert "adds explicit killing of workers" lcswillems/torch-ac#6

Merged

lcswillems closed this as completed Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken pipe when training a model on CPU #42

Broken pipe when training a model on CPU #42

oceank commented Aug 30, 2020 •

edited

Loading

lcswillems commented Aug 31, 2020 •

edited

Loading

oceank commented Aug 31, 2020

vekt0r-github commented Sep 11, 2020

lcswillems commented Sep 26, 2020

bharatprakash commented Oct 18, 2020

lcswillems commented Oct 19, 2020

bharatprakash commented Oct 20, 2020 •

edited

Loading

lcswillems commented Oct 20, 2020

oceank commented Oct 26, 2020

lcswillems commented Nov 1, 2020 •

edited

Loading

lcswillems commented Aug 17, 2021

Broken pipe when training a model on CPU #42

Broken pipe when training a model on CPU #42

Comments

oceank commented Aug 30, 2020 • edited Loading

Error Message

lcswillems commented Aug 31, 2020 • edited Loading

oceank commented Aug 31, 2020

vekt0r-github commented Sep 11, 2020

lcswillems commented Sep 26, 2020

bharatprakash commented Oct 18, 2020

lcswillems commented Oct 19, 2020

bharatprakash commented Oct 20, 2020 • edited Loading

lcswillems commented Oct 20, 2020

oceank commented Oct 26, 2020

lcswillems commented Nov 1, 2020 • edited Loading

lcswillems commented Aug 17, 2021

oceank commented Aug 30, 2020 •

edited

Loading

lcswillems commented Aug 31, 2020 •

edited

Loading

bharatprakash commented Oct 20, 2020 •

edited

Loading

lcswillems commented Nov 1, 2020 •

edited

Loading