You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I turned on Tensorboard with the actor-learner mode in train_dqn_gym.py, the program froze after the first evaluation. I summarized below the repro steps and analysis for this problem.
Reproduction
I've faced the following problem with this commit, the latest master as of Nov. 3rd, 2020.
Steps to reproduce
In examples/gym/train_dqn_gym.py, add use_tensorboard=True as an argument of train_agent_async() (here)
Run python examples/gym/train_dqn_gym.py --actor-learner
Result
The actor process hangs during the first set of evaluation, after showing the following log.
The actor process continuously runs without the hang.
Analysis
The actor process stops here, during summary_writer.add_scalar, where Tensorboard's SummaryWriter seems to suffer from a deadlock.
I suspect that this problem happens because the _AsyncWriterThread, which is internally used in SummaryWriter, does not work in actor processes. Actor processes are forked from the root process with copy of SummaryWriter, but the associated threads, including the one for _AsyncWriterThread, are not copied in a POSIX-based system. Consequently, the queue of the writer is not consumed and jams after it reaches the full capacity. This prevents each actor from adding a new scalar to Tensorboard and the actor gets stuck there.
The text was updated successfully, but these errors were encountered:
When I turned on Tensorboard with the actor-learner mode in
train_dqn_gym.py
, the program froze after the first evaluation. I summarized below the repro steps and analysis for this problem.Reproduction
I've faced the following problem with this commit, the latest master as of Nov. 3rd, 2020.
Steps to reproduce
examples/gym/train_dqn_gym.py
, adduse_tensorboard=True
as an argument oftrain_agent_async()
(here)python examples/gym/train_dqn_gym.py --actor-learner
Result
The actor process hangs during the first set of evaluation, after showing the following log.
Expected behavior
The actor process continuously runs without the hang.
Analysis
The actor process stops here, during
summary_writer.add_scalar
, where Tensorboard'sSummaryWriter
seems to suffer from a deadlock.I suspect that this problem happens because the
_AsyncWriterThread
, which is internally used inSummaryWriter
, does not work in actor processes. Actor processes are forked from the root process with copy ofSummaryWriter
, but the associated threads, including the one for_AsyncWriterThread
, are not copied in a POSIX-based system. Consequently, the queue of the writer is not consumed and jams after it reaches the full capacity. This prevents each actor from adding a new scalar to Tensorboard and the actor gets stuck there.The text was updated successfully, but these errors were encountered: