Actor processes hang in `train_agent_async` when `use_tensorboard=True` #88

g-votte · 2020-11-04T03:26:41Z

When I turned on Tensorboard with the actor-learner mode in train_dqn_gym.py, the program froze after the first evaluation. I summarized below the repro steps and analysis for this problem.

Reproduction

I've faced the following problem with this commit, the latest master as of Nov. 3rd, 2020.

Steps to reproduce

In examples/gym/train_dqn_gym.py, add use_tensorboard=True as an argument of train_agent_async() (here)
Run python examples/gym/train_dqn_gym.py --actor-learner

Result

The actor process hangs during the first set of evaluation, after showing the following log.

...
INFO:pfrl.experiments.train_agent_async:evaluation episode 96 length:200 R:-1494.8058766440454
INFO:pfrl.experiments.train_agent_async:evaluation episode 97 length:200 R:-1592.9273165459317
INFO:pfrl.experiments.train_agent_async:evaluation episode 98 length:200 R:-1533.3344787068036
INFO:pfrl.experiments.train_agent_async:evaluation episode 99 length:200 R:-1570.1153000497297

Expected behavior

The actor process continuously runs without the hang.

Analysis

The actor process stops here, during summary_writer.add_scalar, where Tensorboard's SummaryWriter seems to suffer from a deadlock.

I suspect that this problem happens because the _AsyncWriterThread, which is internally used in SummaryWriter, does not work in actor processes. Actor processes are forked from the root process with copy of SummaryWriter, but the associated threads, including the one for _AsyncWriterThread, are not copied in a POSIX-based system. Consequently, the queue of the writer is not consumed and jams after it reaches the full capacity. This prevents each actor from adding a new scalar to Tensorboard and the actor gets stuck there.

The text was updated successfully, but these errors were encountered:

g-votte mentioned this issue Nov 4, 2020

Fix the hang in train_agent_async with Tensorboard #89

Merged

muupan closed this as completed Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actor processes hang in `train_agent_async` when `use_tensorboard=True` #88

Actor processes hang in `train_agent_async` when `use_tensorboard=True` #88

g-votte commented Nov 4, 2020

Actor processes hang in train_agent_async when use_tensorboard=True #88

Actor processes hang in train_agent_async when use_tensorboard=True #88

Comments

g-votte commented Nov 4, 2020

Reproduction

Steps to reproduce

Result

Expected behavior

Analysis

Actor processes hang in `train_agent_async` when `use_tensorboard=True` #88

Actor processes hang in `train_agent_async` when `use_tensorboard=True` #88