-
Notifications
You must be signed in to change notification settings - Fork 952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible memory leak when inferencing BLOOM 176B #614
Comments
The model is used on our side on the inference API with the same options and without any memory leak, so I suspect the memory leak comes from somewhere else in the script and not Accelerate per se. I'm unfamiliar with Flask for instance, so not sure if it properly releases the memory of the objects no longer needed? In any case, we'd need a smaller reproducer (ideally with a smaller model) to investigate more. We haven't seen anyone reporting any memory leak on large models as of now (including BLOOM, OPT, GPT-Neo-X or T5/T0pp) |
Hmm, @sgugger can you tell which library are you guys using for the inference API, if not flask? |
We use |
Thanks, I can confirm that this issue is not occuring with Starlette and FastAPI (built on top of Starlette). |
@muellerzr @sgugger >>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> import torch
>>> model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", device_map="auto", max_memory={0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB', 4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'}, torch_dtype=torch.bfloat16)
>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
>>> model.generate(**tokenizer("Hello, I am a ", return_tensors="pt"), max_new_tokens=100)
tensor([[ 59414, 15, 473, 912, 267, 210, 2084, 48982, 361,
8608, 17, 473, 912, 10343, 427, 5219, 267, 6353,
4830, 861, 2152, 11330, 267, 4737, 461, 18409, 530,
3262, 368, 5579, 108268, 664, 660, 10029, 15, 718,
2152, 11330, 368, 15321, 461, 861, 10029, 17, 473,
912, 3936, 267, 13828, 66778, 427, 11330, 368, 4737,
461, 18409, 17, 473, 912, 11045, 427, 11330, 368,
4737, 461, 18409, 1965, 3262, 473, 17261, 664, 660,
10029, 15, 718, 632, 1130, 86953, 368, 15321, 461,
861, 10029, 17, 473, 912, 1130, 11097, 3595, 473,
912, 12491, 15879, 17, 473, 912, 3936, 267, 32532,
427, 11330, 368, 4737, 461, 18409, 530]]) Memory footprint before calling generate: |
Hi @mayank31398 could you try to run your code snippet as a script, and measure the memory usage
You can set break point in the script to do so. |
Could it be that this is expected behaviour? I am not sure if this is correct approach to measure memory in pytorch models |
With pdb, I am seeing a blowup too. |
This is correct. I also had an issue previously, see here. In short, PyTorch won't always release GPU memory - it can re-use them later (faster operations), so it doesn't mean there is memory issue. But after |
You may also be able to get a bit more by doing garbage collection as well, after deleting the model in python E.g.: import gc
del model
gc.collect() (Also sorry for accidently closing, on mobile and hit the wrong button!) |
@mayank31398 , when you say For GPT2, after loading the model, my GPU uses 1388 MB, and after inference, it goes to 1470 MB. I won't be surprised with this numbers however. |
Hi, I have something for you @mayank31398 . Below is an example with
So emptying cache can bring the GPU memory usage back to the point where model is loaded to GPU. So I believe there is no issue when we do things locally. However, when combining with web frameworks, things get more complicated, and the measurement depends on your needs. In any case, I would suggest you check the initial GPU memory usage (after the model loaded to GPU), and monitor its usage once inferences are performed. Let us know if you have further question. Here is the code import pdb
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
def run(model, dct):
hypotheses_batch = model.generate(
**dct,
num_beams=4,
length_penalty=2.0,
max_length=142,
min_length=56,
no_repeat_ngram_size=3,
do_sample=False,
early_stopping=True,
)
print("gen. done")
pdb.set_trace()
if __name__ == "__main__":
ckpt = "t5-large"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(ckpt)
model.config.update(model.config.task_specific_params["summarization"])
dct = tokenizer(
[model.config.prefix + x for x in [FRANCE_ARTICLE, SHORTER_ARTICLE, IRAN_ARTICLE, ARTICLE_SUBWAY]],
padding="max_length",
truncation=True,
return_tensors="pt",
)
print("input / model on cpu")
pdb.set_trace()
dct = dct.to("cuda")
print("input to gpu")
pdb.set_trace()
model = model.to("cuda")
print("model to gpu")
pdb.set_trace()
run(model, dct)
print("model run done")
pdb.set_trace()
torch.cuda.empty_cache()
print("clear done")
pdb.set_trace() |
Deleting the model is not an option for me. |
Yes, thanks I think Ill try to see the memory usage over time by running in a for loop or something. |
what @ydshieh said and more: to track real memory usage / debug potential leaks always:
but don't do any of the above for production. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
This is not an issue anymore. Thanks for helping guys. |
@mayank31398 Would you like to share what works for you :-) 🙏 |
I converted my server to flask and ran with gunicorn with 1 worker. |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Script:
https://github.com/mayank31398/Megatron-DeepSpeed/blob/add-generation-server/scripts/inference/bloom-accelerate-server.py
Usage: python scripts/inference/bloom-accelerate-server.py --model_name bigscience/bloom --dtype bf16 --log_file data.log --host $ADDRESS --port $PORT
Memory blowup over time discussed here: bigscience-workshop/Megatron-DeepSpeed#308 (comment)
Expected behavior
The text was updated successfully, but these errors were encountered: