-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken pipe when training a model on CPU #42
Comments
Thank you @oceank for raising this issue!! However, I can't reproduce it... I also don't have time currently to investigate this issue in depth. So if the error persists, I would advise you to try another library. |
Hi, @lcswillems, I ran the code in Ubuntu 18.04 and Python 3.7.3 without GPU. I can not tell where in the training the error was triggered yet. I will check it out. |
Hi, Just adding that I reproduced this with the same command on Ubuntu 18.04.4, Python 3.8.5 without GPU; I believe the broken pipe happens right at the end of training, as the output right before the exception is
which is over 80000 frames. I don't know much about this, but it might suggest that this could be a minor bug (?). |
I am sorry but I can't reproduce the bug... I have tried @oceank command but no problem for me. |
I recently started getting this error. And it happens right at the end of training. |
@bharatprakash What do you mean by "recently"? Did it start yesterday? |
@lcswillems Sorry I should have been more clear. I cloned this repo(and torch-ac) again last week again on the same server for a diff experiment I'm doing. And now I see this error. I see that this is a new commit on torch-ac and thats where @oceank and I see the error. |
Thank you for these details! It could be related to the commit you link but I am not able to reproduce the issues. Do you have a way to reproduce? If you have one, could you go to the commit before this one: lcswillems/torch-ac@64833c6 and tell me if you still get the error? |
@lcswillems lcswillems/torch-ac@64833c6 works well. The broken pipe error is gone in my local run. Thanks for the help. |
@oceank I reverted the commit. Could you tell me if the latest version of torch-ac works fine for you? |
I am closing this issue because I think I fixed the issue. @oceank , if I didn't, please tell me and I will reopen the issue. |
Hi,
I followed the instructions in README.md to train a A2C agent in DoorKey environment using the following command (Python 3.7.3) in Ubuntu 18.04 with 8 CPUs.
python scripts/train.py --algo a2c --env MiniGrid-DoorKey-5x5-v0 --model DoorKey --save-interval 10 --frames 80000
The train went well initially but ended with a BrokenPipeError exception that crashes the training process. The error message is copied below. According to scripts/train.py, the above command will run with 16 processes. Initially, I thought the error was because the training initialized too many processes. But even when setting --procs=6, the same exception happened again. Only when setting --procs=1, the training ran successfully. Is there any special setting I should do to enable the training with multi-processes?
(Just realized that the error roots in torch_ac)
Error Message
The text was updated successfully, but these errors were encountered: