Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should TensorboardWriter close its tf.summary.FileWriter? #855

Open
shwang opened this issue May 14, 2020 · 4 comments
Open

Should TensorboardWriter close its tf.summary.FileWriter? #855

shwang opened this issue May 14, 2020 · 4 comments
Labels
bug Something isn't working help wanted Help from contributors is needed

Comments

@shwang
Copy link

shwang commented May 14, 2020

PPO2 uses a with TensorboardWriter(...) as writer: context that flushes but doesn't ever close its tf.summary.FileWriter. This led to (in combination with another problem on my side) a "too many files are opened by this process" error in one of my runs when I called PPO2.learn() repeatedly.

Maybe the intention here is to allow us to access the same FileWriter later, but a second call to PPO2.learn() in facts opens a new events file and creates a new FileWriter, which again is not closed by the time that learn exits.

Relevant lines in TensorboardWriter:

def __enter__(self):
if self.tensorboard_log_path is not None:
latest_run_id = self._get_latest_run_id()
if self.new_tb_log:
latest_run_id = latest_run_id + 1
save_path = os.path.join(self.tensorboard_log_path, "{}_{}".format(self.tb_log_name, latest_run_id))
self.writer = tf.summary.FileWriter(save_path, graph=self.graph)
return self.writer

def __exit__(self, exc_type, exc_val, exc_tb):
if self.writer is not None:
self.writer.add_graph(self.graph)
self.writer.flush()

@shwang
Copy link
Author

shwang commented May 14, 2020

Maybe the context flushes instead of closing because we should be reusing the old Tensorboard FileWriter when possible.

That way we don't create a new FileWriter, therefore a new events file every time we call PPO2.learn(reset_num_timesteps=False).

I'm ending up with long and growing list of files like:

├── sb_tb
│   └── PPO2_1
│       ├── events.out.tfevents.1589433242.spinach
│       ├── events.out.tfevents.1589433245.spinach
│       ├── events.out.tfevents.1589433248.spinach
│       ├── events.out.tfevents.1589433250.spinach
│       ├── events.out.tfevents.1589433253.spinach
│       ├── events.out.tfevents.1589433255.spinach
│       ├── events.out.tfevents.1589433257.spinach
│       ├── events.out.tfevents.1589433260.spinach
│       ├── events.out.tfevents.1589433262.spinach
│       └── events.out.tfevents.1589433265.spinach

Granted, I can just rely on the ep reward mean logs from Monitor and logger.logkv() which don't use this TensorboardWriter context, so it's not at all critical for me to activate it.

@araffin araffin added the bug Something isn't working label May 14, 2020
@araffin
Copy link
Collaborator

araffin commented May 14, 2020

Hello,

Maybe a duplicate of #501
But really sounds like a bug

@Jiankai-Sun
Copy link

new_tb_log==False here does not work?

@araffin araffin added the help wanted Help from contributors is needed label May 20, 2020
@araffin
Copy link
Collaborator

araffin commented May 20, 2020

new_tb_log==False here does not work?

There is an issue about that: #599 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Help from contributors is needed
Projects
None yet
Development

No branches or pull requests

3 participants