Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] I have been trying to run deepspeed on 32 GB Tesla V 100 GPU #3463

Open
AbhayGoyal opened this issue May 5, 2023 · 15 comments
Open

[BUG] I have been trying to run deepspeed on 32 GB Tesla V 100 GPU #3463

AbhayGoyal opened this issue May 5, 2023 · 15 comments
Assignees
Labels
bug Something isn't working inference

Comments

@AbhayGoyal
Copy link

Describe the bug
I have been trying to run deepspeed on 32 GB Tesla V 100 GPU but it still does not work. I tried parellelizing it over 4 GPUs as well and it shows me a SIGKILL

To Reproduce
Here is the code i ran

`import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')

generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
dtype=torch.float,
replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
print(string)
`

Docker context
Are you using a specific docker image that you can share?

Additional context
Add any other context about the problem here.

@AbhayGoyal AbhayGoyal added bug Something isn't working inference labels May 5, 2023
@mrwyattii mrwyattii self-assigned this May 5, 2023
@mrwyattii
Copy link
Contributor

@AbhayGoyal you need to specify the device in pipeline. If you don't do this, the tokenizer will be on CPU and the model with be on GPU, resulting in the following error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Here is an updated version of your script that should work:

import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
device = torch.device(f"cuda:{local_rank}")
generator = pipeline("text-generation", model="EleutherAI/gpt-neo-2.7B", device=device)

generator.model = deepspeed.init_inference(
    generator.model,
    mp_size=world_size,
    dtype=torch.float,
    replace_with_kernel_inject=True,
)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

@AbhayGoyal
Copy link
Author

AbhayGoyal commented May 5, 2023 via email

@AbhayGoyal
Copy link
Author

I tried the solution you gave. It still gives me the exact same error

@mrwyattii
Copy link
Contributor

@AbhayGoyal can you please share the error message you are seeing? Is it an Out Of Memory error?

@AbhayGoyal
Copy link
Author

AbhayGoyal commented May 8, 2023 via email

@AbhayGoyal
Copy link
Author

AbhayGoyal commented May 9, 2023 via email

@mrwyattii
Copy link
Contributor

https://github.com/microsoft/DeepSpeedExamples/blob/8e4ec02c1545f7bd87d3bfe5daaafa5a5f1fe6a6/inference/huggingface/text-generation/inference-test.py

On Mon, May 8, 2023 at 6:04 PM Abhay Goyal @.> wrote: Actually it turns out that if I run it on just 1 GPU, it works well. Let me send the code here On Mon, May 8, 2023, 12:42 PM Michael Wyatt @.> wrote: > @AbhayGoyal https://github.com/AbhayGoyal can you please share the > error message you are seeing? Is it an Out Of Memory error? > > — > Reply to this email directly, view it on GitHub > <#3463 (comment)>, > or unsubscribe > https://github.com/notifications/unsubscribe-auth/AEMF2JT757WHAQVXIMZMFCTXFEV7XANCNFSM6AAAAAAXXS3NPQ > . > You are receiving this because you were mentioned.Message ID: > @.***> >

What are the exact command line arguments you are using to launch the script? If you can run on a single GPU, it should run on multiple GPU as well. Please ensure you are using --ds_inference and --use_kernel when you run this script!

@AbhayGoyal
Copy link
Author

AbhayGoyal commented May 9, 2023 via email

@karandua2016
Copy link

@AbhayGoyal I was facing the same issue on V100. In my case my process crashed with SIGKILL when I ran out of System RAM. The reason is that the model is first loaded on the CPU, and then moved to GPU by Deepspeed. So if you run the script with more than one GPUs, DS loads multiple instances of the model and may cause system memory to be exceeded.
Can you check the amount of RAM (System RAM not GPU RAM) available? You should run the Inference script and then monitor the RAM using "free -s2 -g".

@AbhayGoyal
Copy link
Author

AbhayGoyal commented May 19, 2023 via email

@KMFODA
Copy link
Contributor

KMFODA commented Jun 15, 2023

Hi all, I'm facing the same issue here. Was wondering wether anyone has any ideas what might be causing this.

I'm trying to run inference on a model that needs 2 A100 GPUs minimum for inference using

/opt/conda/bin/deepspeed /root/DeepSpeedExamples/inference/huggingface/text-generation/inf
erence-test.py --num_gpus 2 --name huggyllama/llama-65b

and getting the sigkill error:

[2023-06-15 15:32:36,151] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 18390
[2023-06-15 15:32:43,064] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 18391

even though in theory the model should fit on 2 A100 GPUs and generate results using deepspeed.

@abmybgx
Copy link

abmybgx commented Aug 15, 2023

Same issue on 8 * A100, mark.

@zzkcaesar
Copy link

Hi, I have encounter the same error on 8*H800 GPU, so any solution about this?

@egesko
Copy link

egesko commented Aug 26, 2023

Same error with 4*RTXA5000 GPU.

@mrwyattii
Copy link
Contributor

Hi All, we have recently made some updates that affect this issue. Please install the latest DeepSpeed and use the latest scripts from https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py

You can now load models using meta tensors to avoid using all the system memory and causing these errors. This works for most models when using Auto Tensor Parallelism (i.e., when not using --use_kernel) and it works for GPT-NEO, BLOOM, OPT, and GPT-J models when using kernel injection (i.e., when using --use_kernel):

deepspeed --num_gpus 2 inference-test.py --model huggyllama/llama-65b --use_meta_tensor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working inference
Projects
None yet
Development

No branches or pull requests

7 participants