New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing memory/deleting a model: how to properly do this #6753
Comments
I encountered similar problems of freeing GPU memory while implementing the benchmark tools. A trick that worked for me was to wrap the function into a multi-process. Maybe you can take a look at this implementation and change your code accordingly so that the model is run in a subprocess:
|
Thanks for getting back! After investigating a bit further, my particular problems seem to be partly related to PyTorch-Lightning (specificially, related to not properly detaching tensors in some of the eval code), but this general bit of advice is good since this seems to be a more general problem that I've seen in other contexts (like you mentioned). I will look more closely at running a multi-process. As a terrible hack (which probably shouldn't be repeated), I found that converting all models/tensors/training params/.. to cpu then deleting them and applying manual garbage collection fixed my issue. |
@patrickvonplaten have you ran into the following error using this method?
Tried setting the context as follows with no success: import multiprocessing as mp
mp.set_start_method('spawn') |
met the same problem, anything update ? |
Very useful!! Thank you so much for sharing your solution! |
Environment info
transformers
version: 2.11.0Who can help
@patrickvonplaten
Information
Model I am using (Bert, XLNet ...): T5-large, T5-3b, bert-base-uncased
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
del
, clear GPU memory and cacheExpected behavior
I would expect this to clear the GPU memory, though the tensors still seem to linger (fuller context: In a larger Pytorch-Lightning script, I'm simply trying to re-load the best model after training (and exiting the pl.Trainer) to run a final evaluation; behavior seems the same as in this simple example (ultimately I run out of memory when loading the best model because the model is the absolutely massive T5-3b).).
The text was updated successfully, but these errors were encountered: