-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT in TP or PP mode #71
Comments
Hi @volkerha, thanks for using MII! If you take a look here, you'll see that regardless of the provider the models are processed by the DeepSpeed Inference Engine. This allows any of the models to be run on multi-GPU setups (using TP). To enable this, just add |
Here's an example for an OPT model that I just tested on 2 GPUs: import mii
mii_config = {"dtype": "fp16", "tensor_parallel": 2}
name = "facebook/opt-1.3b"
mii.deploy(
task="text-generation",
model=name,
deployment_name=name + "_deployment",
mii_config=mii_config,
) |
I tested |
@volkerha you are correct, currently with the huggingface provider, we load the full model onto each GPU here. Once we call I can see how this would be problematic if you don't have enough memory to load the full model on each GPU. We have a workaround that uses meta-tensors (like with the llm provider), but I don't think it's compatible with how we load other huggingface models. @jeffra thoughts on this? |
@mrwyattii Hi I'm having CUDA OOM errors when loading a |
@Tianwei-She responded to your other issue with a solution (#99). @volkerha I've made some changes to how we load models in #105. This doesn't completely address the issue of needing to load multiple copies of a model when using tensor parallelism, but we do have plans to address this further. I'll leave this issue open for now and file it under "Enhancment". |
#199 adds supported for loading models other than BLOOM (including GPT-NeoX, GPT-J, and OPT) using meta tensors. This resolves the problem of loading the model into memory multiple times. |
Is there a way to inference OPT models in TensorParallel or PipelineParallel mode?
As I understand:
BLOOM uses llm provider which loads the model weights as meta tensors first and then assigns devices during checkpoint loading in ds-inference.
OPT uses hf provider with 🤗 pipeline and directly loads checkpoint weights on a specific device.
However, only MP is supported from 🤗 side (using accelerate). Is there a way to inference OPT with llm provider?
The text was updated successfully, but these errors were encountered: