OOM Error when deploying BLOOM-3B on 16GB GPU via MII #103

marshmellow77 · 2022-11-17T06:39:14Z

When deploying the bigscience/bloom-3b (in fp32) via MII on a T4 GPU I receive a CUDA out of memory error, see this notebook. When deploying the same model (also in fp32) via the standard HF Pipeline API, it works, see this notebook.

My expectation would be that it should be possible to deploy the same model via MII if I can deploy it via HF Pipelines. If this is not possible then it'd be good to explain why and set expectations with users.

1 T4, GPU memory 16GB
deepspeed-mii version 0.0.3
transformers version 4.24.0
Amazon Linux 2

The text was updated successfully, but these errors were encountered:

Tianwei-She · 2022-11-17T22:09:30Z

I'm having the same issue. would like to know the answer too

mrwyattii · 2022-11-21T20:00:21Z

I can confirm that I'm able to reproduce this on an A6000 as well. With MII: VRAM usage is ~18GB, with tranformers.pipeline: VRAM usage is ~12GB.

The difference is unexpectedly large, and we are investigating the cause. I'll also note that this is not the case for all models. I just tested a few and many have the same memory usage.

mrwyattii · 2022-11-21T20:13:13Z

@marshmellow77 it appears that the OOM you are seeing when using MII is due to the need for extra VRAM when injecting kernels with DeepSpeed-Inference. This can be avoided by loading the model onto system memory rather than GPU memory before using DeepSpeed-Inference. #105 adds the option to allow users to do this if you could give it a try and let me know the results:

Install this version of MII
pip install git+https://github.com/microsoft/deepspeed-mii@mrwyattii/address-poor-vram-usage

and add the following to your mii_configs: "load_with_sys_mem": True

marshmellow77 · 2022-11-25T15:11:07Z

I can confirm that the mdeol now load into GPU. I can't test the text generation because of #102, but I believe this issue can be closed.

satpalsr · 2023-05-23T05:24:42Z

@mrwyattii Can we not later release the extra gpu memory used when load_with_sys_mem is False?
For smaller models that would at least allow more free memory.

mrwyattii · 2023-07-11T22:16:43Z

@mrwyattii Can we not later release the extra gpu memory used when load_with_sys_mem is False? For smaller models that would at least allow more free memory.

DeepSpeed-inference will release the extra memory after kernel injection happens

marshmellow77 closed this as completed Nov 25, 2022

satpalsr mentioned this issue May 23, 2023

High Memory Usage for MII compared to HF #189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM Error when deploying BLOOM-3B on 16GB GPU via MII #103

OOM Error when deploying BLOOM-3B on 16GB GPU via MII #103

marshmellow77 commented Nov 17, 2022 •

edited

Loading

Tianwei-She commented Nov 17, 2022

mrwyattii commented Nov 21, 2022

mrwyattii commented Nov 21, 2022

marshmellow77 commented Nov 25, 2022

satpalsr commented May 23, 2023 •

edited

Loading

mrwyattii commented Jul 11, 2023

OOM Error when deploying BLOOM-3B on 16GB GPU via MII #103

OOM Error when deploying BLOOM-3B on 16GB GPU via MII #103

Comments

marshmellow77 commented Nov 17, 2022 • edited Loading

Tianwei-She commented Nov 17, 2022

mrwyattii commented Nov 21, 2022

mrwyattii commented Nov 21, 2022

marshmellow77 commented Nov 25, 2022

satpalsr commented May 23, 2023 • edited Loading

mrwyattii commented Jul 11, 2023

marshmellow77 commented Nov 17, 2022 •

edited

Loading

satpalsr commented May 23, 2023 •

edited

Loading