Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak when inferencing BLOOM 176B #614

Closed
2 of 4 tasks
mayank31398 opened this issue Aug 8, 2022 · 20 comments
Closed
2 of 4 tasks

Possible memory leak when inferencing BLOOM 176B #614

mayank31398 opened this issue Aug 8, 2022 · 20 comments
Labels
bug Something isn't working

Comments

@mayank31398
Copy link

System Info

- `Accelerate` version: 0.11.0
- Platform: Linux-4.18.0-305.25.1.el8_4.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.13
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.11.0a0+gitbc2c6ed (True)
- `Accelerate` default config:
	Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Script:
https://github.com/mayank31398/Megatron-DeepSpeed/blob/add-generation-server/scripts/inference/bloom-accelerate-server.py

Usage: python scripts/inference/bloom-accelerate-server.py --model_name bigscience/bloom --dtype bf16 --log_file data.log --host $ADDRESS --port $PORT

Memory blowup over time discussed here: bigscience-workshop/Megatron-DeepSpeed#308 (comment)

Expected behavior

This memory leak should not occur I think.
@mayank31398 mayank31398 added the bug Something isn't working label Aug 8, 2022
@sgugger
Copy link
Collaborator

sgugger commented Aug 8, 2022

The model is used on our side on the inference API with the same options and without any memory leak, so I suspect the memory leak comes from somewhere else in the script and not Accelerate per se. I'm unfamiliar with Flask for instance, so not sure if it properly releases the memory of the objects no longer needed?

In any case, we'd need a smaller reproducer (ideally with a smaller model) to investigate more. We haven't seen anyone reporting any memory leak on large models as of now (including BLOOM, OPT, GPT-Neo-X or T5/T0pp)

@mayank31398
Copy link
Author

Hmm, @sgugger can you tell which library are you guys using for the inference API, if not flask?

@sgugger
Copy link
Collaborator

sgugger commented Aug 9, 2022

We use starlette on our side.

@mayank31398
Copy link
Author

Thanks, I can confirm that this issue is not occuring with Starlette and FastAPI (built on top of Starlette).
Not sure why this happens with Flask.
Closing this ❤️

@mayank31398
Copy link
Author

@muellerzr @sgugger
Nevermind, this is still happening even with this minimal working example:
As you can see I am not even storing any variable, only the model and tokenizer.
This can be easily reproduced by launching python interactively.

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> import torch
>>> model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", device_map="auto", max_memory={0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB', 4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'}, torch_dtype=torch.bfloat16)
>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
>>> model.generate(**tokenizer("Hello, I am a ", return_tensors="pt"), max_new_tokens=100)
tensor([[ 59414,     15,    473,    912,    267,    210,   2084,  48982,    361,
           8608,     17,    473,    912,  10343,    427,   5219,    267,   6353,
           4830,    861,   2152,  11330,    267,   4737,    461,  18409,    530,
           3262,    368,   5579, 108268,    664,    660,  10029,     15,    718,
           2152,  11330,    368,  15321,    461,    861,  10029,     17,    473,
            912,   3936,    267,  13828,  66778,    427,  11330,    368,   4737,
            461,  18409,     17,    473,    912,  11045,    427,  11330,    368,
           4737,    461,  18409,   1965,   3262,    473,  17261,    664,    660,
          10029,     15,    718,    632,   1130,  86953,    368,  15321,    461,
            861,  10029,     17,    473,    912,   1130,  11097,   3595,    473,
            912,  12491,  15879,     17,    473,    912,   3936,    267,  32532,
            427,  11330,    368,   4737,    461,  18409,    530]])

Memory footprint before calling generate:
Screen Shot 2022-08-23 at 5 11 16 PM
Memory footprint after calling generate:
Screen Shot 2022-08-23 at 5 11 26 PM

@mayank31398 mayank31398 reopened this Aug 23, 2022
@mayank31398
Copy link
Author

If I call a torch.cuda.empty_cache() after this, then this happens:
Screen Shot 2022-08-23 at 5 27 13 PM

@ydshieh
Copy link
Contributor

ydshieh commented Aug 23, 2022

Hi @mayank31398

could you try to run your code snippet as a script, and measure the memory usage

  • after model is loaded
  • after a single model forward pass
  • model generate

You can set break point in the script to do so.

@mayank31398
Copy link
Author

mayank31398 commented Aug 23, 2022

Could it be that this is expected behaviour?
@ydshieh
I am seeing a memory blowup with gpt2 also after replacing bigscience/bloom to gpt2

I am not sure if this is correct approach to measure memory in pytorch models
And gpt2 doesn't use accelerate

@mayank31398
Copy link
Author

With pdb, I am seeing a blowup too.
But my guess would be this is not right way to measure memory since, I see something similar with GPT2 as well. I think the PyTorch memory allocator allocates some memory a priori for tensors and that is shown up in nvidia-smi.

@ydshieh
Copy link
Contributor

ydshieh commented Aug 23, 2022

PyTorch memory allocator allocates some memory a priori for tensors and that is shown up in nvidia-smi.

This is correct. I also had an issue previously, see here. In short, PyTorch won't always release GPU memory - it can re-use them later (faster operations), so it doesn't mean there is memory issue.

But after empty_cache, we should see the usage drops down, although might be partially. So from your screenshot (interactive shell), it's strange that nothing is released.

@muellerzr
Copy link
Collaborator

muellerzr commented Aug 23, 2022

You may also be able to get a bit more by doing garbage collection as well, after deleting the model in python

E.g.:

import gc
del model
gc.collect()

(Also sorry for accidently closing, on mobile and hit the wrong button!)

@muellerzr muellerzr reopened this Aug 23, 2022
@ydshieh
Copy link
Contributor

ydshieh commented Aug 23, 2022

@mayank31398 , when you say a blowup too, what it means?
We should measure the difference after the model inference against when the model is loaded to cuda but without running any input.

For GPT2, after loading the model, my GPU uses 1388 MB, and after inference, it goes to 1470 MB. I won't be surprised with this numbers however.

@ydshieh
Copy link
Contributor

ydshieh commented Aug 23, 2022

Hi, I have something for you @mayank31398 . Below is an example with t5-large (GPT2 is quite small to see the difference).

  • model = model.to("cuda"): 3662 MB
  • After the generation is done, but not return yet: 7746 MB
  • After the generation is done, and return to the main: 7746 MB
  • After empty cache: 3732M

So emptying cache can bring the GPU memory usage back to the point where model is loaded to GPU.

So I believe there is no issue when we do things locally. However, when combining with web frameworks, things get more complicated, and the measurement depends on your needs. In any case, I would suggest you check the initial GPU memory usage (after the model loaded to GPU), and monitor its usage once inferences are performed.

Let us know if you have further question.

Here is the code
(The FRANCE_ARTICLE, SHORTER_ARTICLE, IRAN_ARTICLE, ARTICLE_SUBWAY are copied from https://github.com/huggingface/transformers/blob/0f257a87749e0a72bda260c6f319a45dae1e7c4d/tests/models/t5/test_modeling_t5.py#L924)

import pdb

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch


def run(model, dct):

    hypotheses_batch = model.generate(
        **dct,
        num_beams=4,
        length_penalty=2.0,
        max_length=142,
        min_length=56,
        no_repeat_ngram_size=3,
        do_sample=False,
        early_stopping=True,
    )

    print("gen. done")
    pdb.set_trace()


if __name__ == "__main__":

    ckpt = "t5-large"
    tokenizer = AutoTokenizer.from_pretrained(ckpt)

    model = AutoModelForSeq2SeqLM.from_pretrained(ckpt)
    model.config.update(model.config.task_specific_params["summarization"])

    dct = tokenizer(
        [model.config.prefix + x for x in [FRANCE_ARTICLE, SHORTER_ARTICLE, IRAN_ARTICLE, ARTICLE_SUBWAY]],
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    print("input / model on cpu")
    pdb.set_trace()

    dct = dct.to("cuda")
    print("input to gpu")
    pdb.set_trace()

    model = model.to("cuda")
    print("model to gpu")
    pdb.set_trace()

    run(model, dct)
    print("model run done")
    pdb.set_trace()

    torch.cuda.empty_cache()
    print("clear done")
    pdb.set_trace()

@mayank31398
Copy link
Author

You may also be able to get a bit more by doing garbage collection as well, after deleting the model in python

E.g.:

import gc
del model
gc.collect()

(Also sorry for accidently closing, on mobile and hit the wrong button!)

Deleting the model is not an option for me.
I am trying to use the model in a server setting for a lot of folks.
Related PR: bigscience-workshop/Megatron-DeepSpeed#328

@mayank31398
Copy link
Author

Yes, thanks I think Ill try to see the memory usage over time by running in a for loop or something.
To see how this changes memory (both in server and non-server settings).

@stas00
Copy link
Contributor

stas00 commented Aug 23, 2022

what @ydshieh said and more:

to track real memory usage / debug potential leaks always:

  1. call gc.collect() first - since python's GC is scheduled and w/o it you might miss object and its associated memory release
  2. then clear the cuda cache
  3. measure

but don't do any of the above for production.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@mayank31398
Copy link
Author

This is not an issue anymore. Thanks for helping guys.
Closing this :)

@ydshieh
Copy link
Contributor

ydshieh commented Oct 20, 2022

@mayank31398 Would you like to share what works for you :-) 🙏

@mayank31398
Copy link
Author

I converted my server to flask and ran with gunicorn with 1 worker.
This serializes all requests however

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants