Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent GBM memory leaks by forcing actor cleanup after each epoch #3323

Merged
merged 11 commits into from
Apr 6, 2023

Conversation

arnavgarg1
Copy link
Contributor

@arnavgarg1 arnavgarg1 commented Apr 5, 2023

Ray Actors get re-initialized after each epoch/boosting rounds, but the actors leave a lot of garbage hanging around in Ray's object store. This results in a few problems:

  1. Increased in-memory data usage epoch after epoch since artifacts/objects aren't cleaned up, potentially leading to OOMs
  2. Increased disk-spillage from object store to disk once the 20% threshold has passed, increasing time per epoch by an order of magnitude after each epoch

Together, these result in models either OOMing, or taking forever to train.

This PR forces cleanup after each epoch to prevent memory leaks by wrapping the LightGBM train step when using Ray with a ray.remote(max_calls=1) to kill the Ray Actors and GC at the end of every epoch. This involves two changes:

  1. The creation of a modified train_loop function from the base class that enables making ray.remote calls.
  2. Moving the train_step function outside of the LightGBMRayTrainer class and making it its own ray remote function that is called by the train_loop function inside the LightGBMRayTrainer class

@arnavgarg1 arnavgarg1 changed the title Force actor cleanup to prevent memory leaks for GBMs Prevent GBM memory leaks by forcing actor cleanup after each epoch Apr 5, 2023
ludwig/trainers/trainer_lightgbm.py Outdated Show resolved Hide resolved
ludwig/trainers/trainer_lightgbm.py Outdated Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Apr 6, 2023

Unit Test Results

    4 files  ±    0      4 suites  ±0   1h 52m 33s ⏱️ + 1h 11m 17s
185 tests +152  156 ✔️ +126  29 💤 +26  0 ±0 
197 runs  +131  166 ✔️ +106  31 💤 +25  0 ±0 

Results for commit e550a58. ± Comparison against base commit cf5c604.

Copy link
Collaborator

@justinxzhao justinxzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

ludwig/trainers/trainer_lightgbm.py Show resolved Hide resolved
@arnavgarg1 arnavgarg1 merged commit 6979161 into master Apr 6, 2023
7 of 8 checks passed
@arnavgarg1 arnavgarg1 deleted the gbm_mem_leaks branch April 6, 2023 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants