Removing memory/deleting a model: how to properly do this #6753

yakazimir · 2020-08-26T19:07:25Z

Environment info

transformers version: 2.11.0
Platform:
Python version: 3.6.7
PyTorch version (GPU?): 1.4.0
Tensorflow version (GPU?):
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help

@patrickvonplaten

Information

Model I am using (Bert, XLNet ...): T5-large, T5-3b, bert-base-uncased

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Load a model
Try to remove it via del, clear GPU memory and cache

import torch
from transformers import AutoTokenizer, AutoModelWithLMHead
model = AutoModelWithLMHead.from_pretrained("t5-large") # same behavior for `bert-base-uncased`, larger T5 models..
model = model.cuda()
model = model.train()

## delete model 
del model 
torch._C._cuda_emptyCache()
## alternatively 
# with torch.cuda.device("cuda:0"): 
#   ...:     torch.cuda.empty_cache()

## (as per the discussion here: https://discuss.pytorch.org/t/how-to-debug-causes-of-gpu-memory-leaks/6741/3, seeing all the hanging tensors)
import gc
for obj in gc.get_objects():
    try:
        if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
            print(type(obj), obj.size())
    except:
        pass

Expected behavior

I would expect this to clear the GPU memory, though the tensors still seem to linger (fuller context: In a larger Pytorch-Lightning script, I'm simply trying to re-load the best model after training (and exiting the pl.Trainer) to run a final evaluation; behavior seems the same as in this simple example (ultimately I run out of memory when loading the best model because the model is the absolutely massive T5-3b).).

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2020-09-01T14:31:46Z

I encountered similar problems of freeing GPU memory while implementing the benchmark tools. A trick that worked for me was to wrap the function into a multi-process. Maybe you can take a look at this implementation and change your code accordingly so that the model is run in a subprocess:

transformers/src/transformers/benchmark/benchmark_utils.py

Line 64 in 3726754

    
           def separate_process_wrapper_fn(func: Callable[[], None], do_multi_processing: bool) -> Callable[[], None]:

yakazimir · 2020-09-01T20:41:16Z

Thanks for getting back!

After investigating a bit further, my particular problems seem to be partly related to PyTorch-Lightning (specificially, related to not properly detaching tensors in some of the eval code), but this general bit of advice is good since this seems to be a more general problem that I've seen in other contexts (like you mentioned). I will look more closely at running a multi-process.

As a terrible hack (which probably shouldn't be repeated), I found that converting all models/tensors/training params/.. to cpu then deleting them and applying manual garbage collection fixed my issue.

jeanmonet · 2021-04-21T16:00:58Z

I encountered similar problems of freeing GPU memory while implementing the benchmark tools. A trick that worked for me was to wrap the function into a multi-process. Maybe you can take a look at this implementation and change your code accordingly so that the model is run in a subprocess:

transformers/src/transformers/benchmark/benchmark_utils.py

Line 64 in 3726754

def separate_process_wrapper_fn(func: Callable[[], None], do_multi_processing: bool) -> Callable[[], None]:

@patrickvonplaten have you ran into the following error using this method?

Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Tried setting the context as follows with no success:

import multiprocessing as mp
mp.set_start_method('spawn')

iammeizu · 2022-12-15T03:17:38Z

met the same problem, anything update ?

junjingfn · 2023-09-20T17:57:26Z

Very useful!! Thank you so much for sharing your solution!

yakazimir closed this as completed Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing memory/deleting a model: how to properly do this #6753

Removing memory/deleting a model: how to properly do this #6753

yakazimir commented Aug 26, 2020 •

edited

patrickvonplaten commented Sep 1, 2020

yakazimir commented Sep 1, 2020 •

edited

jeanmonet commented Apr 21, 2021 •

edited

iammeizu commented Dec 15, 2022

junjingfn commented Sep 20, 2023

Removing memory/deleting a model: how to properly do this #6753

Removing memory/deleting a model: how to properly do this #6753

Comments

yakazimir commented Aug 26, 2020 • edited

Environment info

Who can help

Information

To reproduce

Expected behavior

patrickvonplaten commented Sep 1, 2020

yakazimir commented Sep 1, 2020 • edited

jeanmonet commented Apr 21, 2021 • edited

iammeizu commented Dec 15, 2022

junjingfn commented Sep 20, 2023

yakazimir commented Aug 26, 2020 •

edited

yakazimir commented Sep 1, 2020 •

edited

jeanmonet commented Apr 21, 2021 •

edited