Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training process gets killed due to OOM #82

Closed
aceofgreens opened this issue Aug 24, 2023 · 1 comment
Closed

Training process gets killed due to OOM #82

aceofgreens opened this issue Aug 24, 2023 · 1 comment
Labels
config New or improved configuration good first issue Good for newcomers

Comments

@aceofgreens
Copy link

aceofgreens commented Aug 24, 2023

Summary of issue

The training process gets killed by the kernel. There is a log in dmesg stating that the reason is "out of memory".

Model: MuZero with self-supervision
Environment: Pong
Architecture is exactly the same as the default one for Atari envs except that:

  • I am using RGB instead of grayscale (so input to the model is (B, 12, 96, 96) with 4 stacked frames)
  • I am using a few additional layers in the representation network

The process gets killed after 40k iteration steps (a bit more than 500k environment steps). The Buffer/memory_usage/process log shows that the total memory used starts from 0 and increases a bit faster than linearly to 6e+4, after which the process is killed.

NOTE: I have been able to reproduce the "Quick Start" training run on Pong with the default config. No issue there.

General questions:

  1. Why does the memory used by the process seem to always increase? Is it the replay buffer?
  2. Is there a way to control the memory used from any of the config settings, so that the process does not get killed?
@puyuan1996 puyuan1996 added good first issue Good for newcomers config New or improved configuration labels Aug 28, 2023
@puyuan1996
Copy link
Collaborator

Hello,

Thank you for your attention and inquiry.

  1. The steady increase in process memory is primarily due to the continuous addition of collected data into the replay buffer, causing the memory occupation of the replay buffer to rise persistently. We can make a rough estimate: 12*96*96/1024/1024/1024*1e6 ≈ 103GB. Here, 1e6 is the default capacity/size of the replay buffer.

  2. You can indeed control memory usage by modifying the configuration settings.

    • Firstly, we recommend using grayscale images, which reduces memory usage by a factor of 3. Previous experiments have shown that this change will hardly affect performance negatively.
    • Secondly, you may consider reducing the size of the replay buffer. However, please note that this may slightly decrease the performance of the algorithm, and the specific impact would depend on your environment and algorithm settings.
    • Furthermore, you can consider using a more efficient data storage format, such as converting images into strings for storage. You can add "transform2string=True, gray_scale=True," to the policy field in the configuration. However, please note that this feature is currently under development, and we highly welcome your contribution.

If the above methods cannot solve the problem, you might need to consider increasing the memory capacity of your system.

Best wishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
config New or improved configuration good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants