Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actor processes hang in train_agent_async when use_tensorboard=True #88

Closed
g-votte opened this issue Nov 4, 2020 · 0 comments
Closed

Comments

@g-votte
Copy link
Contributor

g-votte commented Nov 4, 2020

When I turned on Tensorboard with the actor-learner mode in train_dqn_gym.py, the program froze after the first evaluation. I summarized below the repro steps and analysis for this problem.

Reproduction

I've faced the following problem with this commit, the latest master as of Nov. 3rd, 2020.

Steps to reproduce

  1. In examples/gym/train_dqn_gym.py, add use_tensorboard=True as an argument of train_agent_async() (here)
  2. Run python examples/gym/train_dqn_gym.py --actor-learner

Result

The actor process hangs during the first set of evaluation, after showing the following log.

...
INFO:pfrl.experiments.train_agent_async:evaluation episode 96 length:200 R:-1494.8058766440454
INFO:pfrl.experiments.train_agent_async:evaluation episode 97 length:200 R:-1592.9273165459317
INFO:pfrl.experiments.train_agent_async:evaluation episode 98 length:200 R:-1533.3344787068036
INFO:pfrl.experiments.train_agent_async:evaluation episode 99 length:200 R:-1570.1153000497297

Expected behavior

The actor process continuously runs without the hang.

Analysis

The actor process stops here, during summary_writer.add_scalar, where Tensorboard's SummaryWriter seems to suffer from a deadlock.

I suspect that this problem happens because the _AsyncWriterThread, which is internally used in SummaryWriter, does not work in actor processes. Actor processes are forked from the root process with copy of SummaryWriter, but the associated threads, including the one for _AsyncWriterThread, are not copied in a POSIX-based system. Consequently, the queue of the writer is not consumed and jams after it reaches the full capacity. This prevents each actor from adding a new scalar to Tensorboard and the actor gets stuck there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants