Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM Error when deploying BLOOM-3B on 16GB GPU via MII #103

Closed
marshmellow77 opened this issue Nov 17, 2022 · 6 comments
Closed

OOM Error when deploying BLOOM-3B on 16GB GPU via MII #103

marshmellow77 opened this issue Nov 17, 2022 · 6 comments

Comments

@marshmellow77
Copy link

marshmellow77 commented Nov 17, 2022

When deploying the bigscience/bloom-3b (in fp32) via MII on a T4 GPU I receive a CUDA out of memory error, see this notebook. When deploying the same model (also in fp32) via the standard HF Pipeline API, it works, see this notebook.

My expectation would be that it should be possible to deploy the same model via MII if I can deploy it via HF Pipelines. If this is not possible then it'd be good to explain why and set expectations with users.

  • 1 T4, GPU memory 16GB
  • deepspeed-mii version 0.0.3
  • transformers version 4.24.0
  • Amazon Linux 2
@Tianwei-She
Copy link

I'm having the same issue. would like to know the answer too

@mrwyattii
Copy link
Contributor

I can confirm that I'm able to reproduce this on an A6000 as well. With MII: VRAM usage is ~18GB, with tranformers.pipeline: VRAM usage is ~12GB.

The difference is unexpectedly large, and we are investigating the cause. I'll also note that this is not the case for all models. I just tested a few and many have the same memory usage.

@mrwyattii
Copy link
Contributor

@marshmellow77 it appears that the OOM you are seeing when using MII is due to the need for extra VRAM when injecting kernels with DeepSpeed-Inference. This can be avoided by loading the model onto system memory rather than GPU memory before using DeepSpeed-Inference. #105 adds the option to allow users to do this if you could give it a try and let me know the results:

Install this version of MII
pip install git+https://github.com/microsoft/deepspeed-mii@mrwyattii/address-poor-vram-usage

and add the following to your mii_configs: "load_with_sys_mem": True

@marshmellow77
Copy link
Author

I can confirm that the mdeol now load into GPU. I can't test the text generation because of #102, but I believe this issue can be closed.

@satpalsr
Copy link

satpalsr commented May 23, 2023

@mrwyattii Can we not later release the extra gpu memory used when load_with_sys_mem is False?
For smaller models that would at least allow more free memory.

@mrwyattii
Copy link
Contributor

@mrwyattii Can we not later release the extra gpu memory used when load_with_sys_mem is False? For smaller models that would at least allow more free memory.

DeepSpeed-inference will release the extra memory after kernel injection happens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants