Possible memory leak when inferencing BLOOM 176B #614

mayank31398 · 2022-08-08T22:00:58Z

System Info

- `Accelerate` version: 0.11.0
- Platform: Linux-4.18.0-305.25.1.el8_4.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.13
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.11.0a0+gitbc2c6ed (True)
- `Accelerate` default config:
	Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Script:
https://github.com/mayank31398/Megatron-DeepSpeed/blob/add-generation-server/scripts/inference/bloom-accelerate-server.py

Usage: python scripts/inference/bloom-accelerate-server.py --model_name bigscience/bloom --dtype bf16 --log_file data.log --host $ADDRESS --port $PORT

Memory blowup over time discussed here: bigscience-workshop/Megatron-DeepSpeed#308 (comment)

Expected behavior

This memory leak should not occur I think.

The text was updated successfully, but these errors were encountered:

sgugger · 2022-08-08T22:11:59Z

The model is used on our side on the inference API with the same options and without any memory leak, so I suspect the memory leak comes from somewhere else in the script and not Accelerate per se. I'm unfamiliar with Flask for instance, so not sure if it properly releases the memory of the objects no longer needed?

In any case, we'd need a smaller reproducer (ideally with a smaller model) to investigate more. We haven't seen anyone reporting any memory leak on large models as of now (including BLOOM, OPT, GPT-Neo-X or T5/T0pp)

mayank31398 · 2022-08-08T22:31:28Z

Hmm, @sgugger can you tell which library are you guys using for the inference API, if not flask?

sgugger · 2022-08-09T12:22:51Z

We use starlette on our side.

mayank31398 · 2022-08-22T15:23:24Z

Thanks, I can confirm that this issue is not occuring with Starlette and FastAPI (built on top of Starlette).
Not sure why this happens with Flask.
Closing this ❤️

mayank31398 · 2022-08-23T11:43:36Z

@muellerzr @sgugger
Nevermind, this is still happening even with this minimal working example:
As you can see I am not even storing any variable, only the model and tokenizer.
This can be easily reproduced by launching python interactively.

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> import torch
>>> model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", device_map="auto", max_memory={0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB', 4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'}, torch_dtype=torch.bfloat16)
>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
>>> model.generate(**tokenizer("Hello, I am a ", return_tensors="pt"), max_new_tokens=100)
tensor([[ 59414,     15,    473,    912,    267,    210,   2084,  48982,    361,
           8608,     17,    473,    912,  10343,    427,   5219,    267,   6353,
           4830,    861,   2152,  11330,    267,   4737,    461,  18409,    530,
           3262,    368,   5579, 108268,    664,    660,  10029,     15,    718,
           2152,  11330,    368,  15321,    461,    861,  10029,     17,    473,
            912,   3936,    267,  13828,  66778,    427,  11330,    368,   4737,
            461,  18409,     17,    473,    912,  11045,    427,  11330,    368,
           4737,    461,  18409,   1965,   3262,    473,  17261,    664,    660,
          10029,     15,    718,    632,   1130,  86953,    368,  15321,    461,
            861,  10029,     17,    473,    912,   1130,  11097,   3595,    473,
            912,  12491,  15879,     17,    473,    912,   3936,    267,  32532,
            427,  11330,    368,   4737,    461,  18409,    530]])

Memory footprint before calling generate:

Memory footprint after calling generate:

mayank31398 · 2022-08-23T11:57:55Z

If I call a torch.cuda.empty_cache() after this, then this happens:

ydshieh · 2022-08-23T12:11:39Z

Hi @mayank31398

could you try to run your code snippet as a script, and measure the memory usage

after model is loaded
after a single model forward pass
model generate

You can set break point in the script to do so.

mayank31398 · 2022-08-23T12:38:49Z

Could it be that this is expected behaviour?
@ydshieh
I am seeing a memory blowup with gpt2 also after replacing bigscience/bloom to gpt2

I am not sure if this is correct approach to measure memory in pytorch models
And gpt2 doesn't use accelerate

mayank31398 · 2022-08-23T12:46:41Z

With pdb, I am seeing a blowup too.
But my guess would be this is not right way to measure memory since, I see something similar with GPT2 as well. I think the PyTorch memory allocator allocates some memory a priori for tensors and that is shown up in nvidia-smi.

ydshieh · 2022-08-23T13:07:31Z

PyTorch memory allocator allocates some memory a priori for tensors and that is shown up in nvidia-smi.

This is correct. I also had an issue previously, see here. In short, PyTorch won't always release GPU memory - it can re-use them later (faster operations), so it doesn't mean there is memory issue.

But after empty_cache, we should see the usage drops down, although might be partially. So from your screenshot (interactive shell), it's strange that nothing is released.

muellerzr · 2022-08-23T13:08:56Z

You may also be able to get a bit more by doing garbage collection as well, after deleting the model in python

E.g.:

import gc
del model
gc.collect()

(Also sorry for accidently closing, on mobile and hit the wrong button!)

ydshieh · 2022-08-23T14:30:28Z

@mayank31398 , when you say a blowup too, what it means?
We should measure the difference after the model inference against when the model is loaded to cuda but without running any input.

For GPT2, after loading the model, my GPU uses 1388 MB, and after inference, it goes to 1470 MB. I won't be surprised with this numbers however.

ydshieh · 2022-08-23T15:11:24Z

Hi, I have something for you @mayank31398 . Below is an example with t5-large (GPT2 is quite small to see the difference).

model = model.to("cuda"): 3662 MB
After the generation is done, but not return yet: 7746 MB
After the generation is done, and return to the main: 7746 MB
After empty cache: 3732M

So emptying cache can bring the GPU memory usage back to the point where model is loaded to GPU.

So I believe there is no issue when we do things locally. However, when combining with web frameworks, things get more complicated, and the measurement depends on your needs. In any case, I would suggest you check the initial GPU memory usage (after the model loaded to GPU), and monitor its usage once inferences are performed.

Let us know if you have further question.

Here is the code
(The FRANCE_ARTICLE, SHORTER_ARTICLE, IRAN_ARTICLE, ARTICLE_SUBWAY are copied from https://github.com/huggingface/transformers/blob/0f257a87749e0a72bda260c6f319a45dae1e7c4d/tests/models/t5/test_modeling_t5.py#L924)

import pdb

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch


def run(model, dct):

    hypotheses_batch = model.generate(
        **dct,
        num_beams=4,
        length_penalty=2.0,
        max_length=142,
        min_length=56,
        no_repeat_ngram_size=3,
        do_sample=False,
        early_stopping=True,
    )

    print("gen. done")
    pdb.set_trace()


if __name__ == "__main__":

    ckpt = "t5-large"
    tokenizer = AutoTokenizer.from_pretrained(ckpt)

    model = AutoModelForSeq2SeqLM.from_pretrained(ckpt)
    model.config.update(model.config.task_specific_params["summarization"])

    dct = tokenizer(
        [model.config.prefix + x for x in [FRANCE_ARTICLE, SHORTER_ARTICLE, IRAN_ARTICLE, ARTICLE_SUBWAY]],
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    print("input / model on cpu")
    pdb.set_trace()

    dct = dct.to("cuda")
    print("input to gpu")
    pdb.set_trace()

    model = model.to("cuda")
    print("model to gpu")
    pdb.set_trace()

    run(model, dct)
    print("model run done")
    pdb.set_trace()

    torch.cuda.empty_cache()
    print("clear done")
    pdb.set_trace()

mayank31398 · 2022-08-23T16:04:02Z

You may also be able to get a bit more by doing garbage collection as well, after deleting the model in python

E.g.:
import gc
del model
gc.collect()
(Also sorry for accidently closing, on mobile and hit the wrong button!)

Deleting the model is not an option for me.
I am trying to use the model in a server setting for a lot of folks.
Related PR: bigscience-workshop/Megatron-DeepSpeed#328

mayank31398 · 2022-08-23T16:07:28Z

Yes, thanks I think Ill try to see the memory usage over time by running in a for loop or something.
To see how this changes memory (both in server and non-server settings).

stas00 · 2022-08-23T17:17:42Z

what @ydshieh said and more:

to track real memory usage / debug potential leaks always:

call gc.collect() first - since python's GC is scheduled and w/o it you might miss object and its associated memory release
then clear the cuda cache
measure

but don't do any of the above for production.

github-actions · 2022-09-17T15:26:26Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

mayank31398 · 2022-09-20T14:31:42Z

This is not an issue anymore. Thanks for helping guys.
Closing this :)

ydshieh · 2022-10-20T19:19:33Z

@mayank31398 Would you like to share what works for you :-) 🙏

mayank31398 · 2022-10-21T06:30:08Z

I converted my server to flask and ran with gunicorn with 1 worker.
This serializes all requests however

mayank31398 added the bug Something isn't working label Aug 8, 2022

mayank31398 mentioned this issue Aug 11, 2022

Add generation server scripts using HF accelerate and DS-inference bigscience-workshop/Megatron-DeepSpeed#328

Merged

mayank31398 closed this as completed Aug 22, 2022

mayank31398 reopened this Aug 23, 2022

muellerzr closed this as completed Aug 23, 2022

muellerzr reopened this Aug 23, 2022

mayank31398 closed this as completed Sep 20, 2022

ydshieh mentioned this issue Oct 20, 2022

Running Stable Diffusion in FastAPI Does Not Release GPU Memory huggingface/diffusers#867

Closed

chschroeder mentioned this issue Mar 13, 2023

Memory Leak when running encoding in ThreadPool UKPLab/sentence-transformers#1854

Open

younesbelkada mentioned this issue Jan 23, 2024

Memory consumption for inference with Llama2-7B is weird huggingface/transformers#28651

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak when inferencing BLOOM 176B #614

Possible memory leak when inferencing BLOOM 176B #614

mayank31398 commented Aug 8, 2022

sgugger commented Aug 8, 2022

mayank31398 commented Aug 8, 2022

sgugger commented Aug 9, 2022

mayank31398 commented Aug 22, 2022

mayank31398 commented Aug 23, 2022

mayank31398 commented Aug 23, 2022

ydshieh commented Aug 23, 2022

mayank31398 commented Aug 23, 2022 •

edited

Loading

mayank31398 commented Aug 23, 2022

ydshieh commented Aug 23, 2022 •

edited

Loading

muellerzr commented Aug 23, 2022 •

edited

Loading

ydshieh commented Aug 23, 2022

ydshieh commented Aug 23, 2022 •

edited

Loading

mayank31398 commented Aug 23, 2022

mayank31398 commented Aug 23, 2022

stas00 commented Aug 23, 2022

github-actions bot commented Sep 17, 2022

mayank31398 commented Sep 20, 2022

ydshieh commented Oct 20, 2022

mayank31398 commented Oct 21, 2022

Possible memory leak when inferencing BLOOM 176B #614

Possible memory leak when inferencing BLOOM 176B #614

Comments

mayank31398 commented Aug 8, 2022

System Info

Information

Tasks

Reproduction

Expected behavior

sgugger commented Aug 8, 2022

mayank31398 commented Aug 8, 2022

sgugger commented Aug 9, 2022

mayank31398 commented Aug 22, 2022

mayank31398 commented Aug 23, 2022

mayank31398 commented Aug 23, 2022

ydshieh commented Aug 23, 2022

mayank31398 commented Aug 23, 2022 • edited Loading

mayank31398 commented Aug 23, 2022

ydshieh commented Aug 23, 2022 • edited Loading

muellerzr commented Aug 23, 2022 • edited Loading

ydshieh commented Aug 23, 2022

ydshieh commented Aug 23, 2022 • edited Loading

mayank31398 commented Aug 23, 2022

mayank31398 commented Aug 23, 2022

stas00 commented Aug 23, 2022

github-actions bot commented Sep 17, 2022

mayank31398 commented Sep 20, 2022

ydshieh commented Oct 20, 2022

mayank31398 commented Oct 21, 2022

mayank31398 commented Aug 23, 2022 •

edited

Loading

ydshieh commented Aug 23, 2022 •

edited

Loading

muellerzr commented Aug 23, 2022 •

edited

Loading

ydshieh commented Aug 23, 2022 •

edited

Loading