Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT in TP or PP mode #71

Closed
volkerha opened this issue Oct 17, 2022 · 7 comments
Closed

OPT in TP or PP mode #71

volkerha opened this issue Oct 17, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@volkerha
Copy link

Is there a way to inference OPT models in TensorParallel or PipelineParallel mode?

As I understand:

  • BLOOM uses llm provider which loads the model weights as meta tensors first and then assigns devices during checkpoint loading in ds-inference.

  • OPT uses hf provider with 🤗 pipeline and directly loads checkpoint weights on a specific device.

However, only MP is supported from 🤗 side (using accelerate). Is there a way to inference OPT with llm provider?

@mrwyattii
Copy link
Contributor

Hi @volkerha, thanks for using MII! If you take a look here, you'll see that regardless of the provider the models are processed by the DeepSpeed Inference Engine. This allows any of the models to be run on multi-GPU setups (using TP). To enable this, just add "tensor_parallel":2 to your mii_config dict passed to mii.deploy(). Some of our examples demonstrate this: https://github.com/microsoft/DeepSpeed-MII/blob/main/examples/local/text-generation-bloom-example.py

@mrwyattii
Copy link
Contributor

Here's an example for an OPT model that I just tested on 2 GPUs:

import mii

mii_config = {"dtype": "fp16", "tensor_parallel": 2}
name = "facebook/opt-1.3b"

mii.deploy(
    task="text-generation",
    model=name,
    deployment_name=name + "_deployment",
    mii_config=mii_config,
)

@volkerha
Copy link
Author

I tested facebook/opt-6.7b on 8 GPUs with TP=8, FP16. I takes around 28GB per GPU which looks like it's loading the full model parameters (6.7B * 4 bytes ~= 27GB) on every GPU in FP32 (maybe because fp16 is only applied after model loading?).

@mrwyattii
Copy link
Contributor

@volkerha you are correct, currently with the huggingface provider, we load the full model onto each GPU here. Once we call deepspeed.init_inference on this line, the model gets split across multiple GPU.

I can see how this would be problematic if you don't have enough memory to load the full model on each GPU. We have a workaround that uses meta-tensors (like with the llm provider), but I don't think it's compatible with how we load other huggingface models. @jeffra thoughts on this?

@Tianwei-She
Copy link

Tianwei-She commented Nov 16, 2022

@mrwyattii Hi I'm having CUDA OOM errors when loading a EleutherAI/gpt-neox-20b model onto 8 GPUs with TP=8, FP16. Each GPU has 23GB. Is this expected? and does this mean I should use the meta-tensors workaround you mentioned above to load this model? Thanks!

@mrwyattii
Copy link
Contributor

@Tianwei-She responded to your other issue with a solution (#99).

@volkerha I've made some changes to how we load models in #105. This doesn't completely address the issue of needing to load multiple copies of a model when using tensor parallelism, but we do have plans to address this further. I'll leave this issue open for now and file it under "Enhancment".

@mrwyattii mrwyattii added the enhancement New feature or request label Nov 21, 2022
@mrwyattii
Copy link
Contributor

#199 adds supported for loading models other than BLOOM (including GPT-NeoX, GPT-J, and OPT) using meta tensors. This resolves the problem of loading the model into memory multiple times.

@volkerha volkerha closed this as completed Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants