Training process gets killed due to OOM #82

aceofgreens · 2023-08-24T11:01:24Z

Summary of issue

The training process gets killed by the kernel. There is a log in dmesg stating that the reason is "out of memory".

Model: MuZero with self-supervision
Environment: Pong
Architecture is exactly the same as the default one for Atari envs except that:

I am using RGB instead of grayscale (so input to the model is (B, 12, 96, 96) with 4 stacked frames)
I am using a few additional layers in the representation network

The process gets killed after 40k iteration steps (a bit more than 500k environment steps). The Buffer/memory_usage/process log shows that the total memory used starts from 0 and increases a bit faster than linearly to 6e+4, after which the process is killed.

NOTE: I have been able to reproduce the "Quick Start" training run on Pong with the default config. No issue there.

General questions:

Why does the memory used by the process seem to always increase? Is it the replay buffer?
Is there a way to control the memory used from any of the config settings, so that the process does not get killed?

The text was updated successfully, but these errors were encountered:

puyuan1996 · 2023-08-28T11:28:44Z

Hello,

Thank you for your attention and inquiry.

The steady increase in process memory is primarily due to the continuous addition of collected data into the replay buffer, causing the memory occupation of the replay buffer to rise persistently. We can make a rough estimate: 12*96*96/1024/1024/1024*1e6 ≈ 103GB. Here, 1e6 is the default capacity/size of the replay buffer.
You can indeed control memory usage by modifying the configuration settings.
- Firstly, we recommend using grayscale images, which reduces memory usage by a factor of 3. Previous experiments have shown that this change will hardly affect performance negatively.
- Secondly, you may consider reducing the size of the replay buffer. However, please note that this may slightly decrease the performance of the algorithm, and the specific impact would depend on your environment and algorithm settings.
- Furthermore, you can consider using a more efficient data storage format, such as converting images into strings for storage. You can add "transform2string=True, gray_scale=True," to the policy field in the configuration. However, please note that this feature is currently under development, and we highly welcome your contribution.

If the above methods cannot solve the problem, you might need to consider increasing the memory capacity of your system.

Best wishes.

puyuan1996 added good first issue Good for newcomers config New or improved configuration labels Aug 28, 2023

puyuan1996 closed this as completed Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training process gets killed due to OOM #82

Training process gets killed due to OOM #82

aceofgreens commented Aug 24, 2023 •

edited

Loading

puyuan1996 commented Aug 28, 2023

Training process gets killed due to OOM #82

Training process gets killed due to OOM #82

Comments

aceofgreens commented Aug 24, 2023 • edited Loading

Summary of issue

puyuan1996 commented Aug 28, 2023

aceofgreens commented Aug 24, 2023 •

edited

Loading