Loading a HF Model in Multiple GPUs and Run Inferences Using deepspeed.init_inference() on those GPUs #4145

asifehmad · 2023-08-14T18:39:06Z

Hi,
I loaded this model from HF using the below code on 2xA100s:

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", use_fast=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map='balanced')

Then I loaded it using deepspeed.init_inference() as:

ds_model = deepspeed.init_inference(
    model=model,
    mp_size=2,
    dtype=torch.float16,
    replace_method="auto",
    replace_with_kernel_inject=True)

and then for inferences I used:

import time
start = time.time()
prompt = 'What is the Capital of France? '
inputs = tokenizer.encode("<human>: prompt \n<bot>:", return_tensors='pt').to(model.device)
outputs = ds_model.generate(inputs, max_new_tokens=500)
output_str = tokenizer.decode(outputs[0])
print(output_str)
end = time.time()
print('Inference Time is:', end - start)

Unfortunately, I got this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 342, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 116, in __init__
    self._create_model_parallel_group(config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 222, in _create_model_parallel_group
    self.mp_group = dist.new_group(ranks)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 185, in new_group
    return cdb.new_group(ranks)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 325, in new_group
    return torch.distributed.new_group(ranks)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3313, in new_group
    raise RuntimeError(
RuntimeError: the new group's world size should be less or equal to the world size set by init_process_group

I tried the mp_size as 1 too, but in vain.

Any guidance/help would be highly appreciated, thanks in anticipation!

The text was updated successfully, but these errors were encountered:

nadimintikrish · 2023-08-14T22:21:19Z

try setting this up
os.getenv('WORLD_SIZE', '1')

asifehmad · 2023-08-15T09:42:05Z

try setting this up os.getenv('WORLD_SIZE', '1')

Hi, I used it something like this:

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import deepspeed

world_size = os.getenv('WORLD_SIZE', '2')
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", use_fast=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True,device_map='balanced')

ds_model = deepspeed.init_inference(
            model=model,
                mp_size=world_size,
                    dtype=torch.float16,
                        replace_method="auto",
                            replace_with_kernel_inject=True)


import time
start = time.time()
prompt = 'What is the Capital of France? '
inputs = tokenizer.encode("<human>: prompt \n<bot>:", return_tensors='pt').to(model.device)
outputs = ds_model.generate(inputs, max_new_tokens=256)
output_str = tokenizer.decode(outputs[0])
print(output_str)
end = time.time()
print('Inference Time is:', end - start)

I put the code in model.py and ran it using deepspeed --num_gpus 2 model.py but got this error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when chec
king argument for argument index in method wrapper_CUDA__index_select)
[2023-08-15 09:03:03,284] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3411
[2023-08-15 09:03:03,581] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3412
[2023-08-15 09:03:03,582] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-u', 'model.py', '--local_rank=1']
exits with return code = 1

Used setting world_size = os.getenv('WORLD_SIZE', '1') too but got the same above error.

asifehmad · 2023-08-15T11:54:04Z

try setting this up os.getenv('WORLD_SIZE', '1')

Okay, so it works now for this script:

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import deepspeed
import time

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", use_fast=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True,device_map=local_rank)

ds_model = deepspeed.init_inference(
         model=model, mp_size=world_size, dtype=torch.float16, replace_method="auto", replace_with_kernel_inject=True)

start = time.time()
prompt = 'Write a detailed note on AI'
inputs = tokenizer.encode(f"<human>: {prompt} \n<bot>:", return_tensors='pt').to(model.device)
outputs = ds_model.generate(inputs, max_new_tokens=2000)
output_str = tokenizer.decode(outputs[0])
print(output_str)
end = time.time()
print('Inference Time is:', end - start)

It is giving me two responses instead of one, this is a new problem I am facing.

thedaffodil · 2023-08-15T17:36:19Z

Hello again came from discuss.huggingface :)

How can I modify this code to this solution ?
Which part should I change to run on 2 gpu?

from huggingface_hub import hf_hub_download
import torch
import os
from open_flamingo import create_model_and_transforms
from accelerate import Accelerator
from einops import repeat
from PIL import Image
import sys
sys.path.append('..')
from src.utils import FlamingoProcessor
from demo_utils import image_paths, clean_generation


def main():
    accelerator = Accelerator() #when using cpu: cpu=True

    device = accelerator.device
    
    print('Loading model..')

    # >>> add your local path to Llama-7B (v1) model here:
    llama_path = '../models/llama-7b-hf'
    if not os.path.exists(llama_path):
        raise ValueError('Llama model not yet set up, please check README for instructions!')

    model, image_processor, tokenizer = create_model_and_transforms(
        clip_vision_encoder_path="ViT-L-14",
        clip_vision_encoder_pretrained="openai",
        lang_encoder_path=llama_path,
        tokenizer_path=llama_path,
        cross_attn_every_n_layers=4
    )
    # load med-flamingo checkpoint:
    checkpoint_path = hf_hub_download("med-flamingo/med-flamingo", "model.pt")
    print(f'Downloaded Med-Flamingo checkpoint to {checkpoint_path}')
    model.load_state_dict(torch.load(checkpoint_path, map_location=device), strict=False)
    processor = FlamingoProcessor(tokenizer, image_processor)

    # go into eval model and prepare:
    model = accelerator.prepare(model)
    is_main_process = accelerator.is_main_process
    model.eval()

    """
    Step 1: Load images
    """
    demo_images = [Image.open(path) for path in image_paths]

    """
    Step 2: Define multimodal few-shot prompt 
    """

    # example few-shot prompt:
    prompt = "You are a helpful medical assistant. You are being provided with images, a question about the image and an answer. Follow the examples and answer the last question. <image>Question: What is/are the structure near/in the middle of the brain? Answer: pons.<|endofchunk|><image>Question: Is there evidence of a right apical pneumothorax on this chest x-ray? Answer: yes.<|endofchunk|><image>Question: Is/Are there air in the patient's peritoneal cavity? Answer: no.<|endofchunk|><image>Question: Does the heart appear enlarged? Answer: yes.<|endofchunk|><image>Question: What side are the infarcts located? Answer: bilateral.<|endofchunk|><image>Question: Which image modality is this? Answer: mr flair.<|endofchunk|><image>Question: Where is the largest mass located in the cerebellum? Answer:"

    """
    Step 3: Preprocess data 
    """
    print('Preprocess data')
    pixels = processor.preprocess_images(demo_images)
    pixels = repeat(pixels, 'N c h w -> b N T c h w', b=1, T=1)
    tokenized_data = processor.encode_text(prompt)

    """
    Step 4: Generate response 
    """

    # actually run few-shot prompt through model:
    print('Generate from multimodal few-shot prompt')
    generated_text = model.generate(
    vision_x=pixels.to(device),
    lang_x=tokenized_data["input_ids"].to(device),
    attention_mask=tokenized_data["attention_mask"].to(device),
    max_new_tokens=10,
    )
    response = processor.tokenizer.decode(generated_text[0])
    response = clean_generation(response)

    print(f'{response=}')

if __name__ == "__main__":

    main()

asifehmad · 2023-08-17T17:39:09Z

Hello again came from discuss.huggingface :)

How can I modify this code to this solution ? Which part should I change to run on 2 gpu?

from huggingface_hub import hf_hub_download import torch import os from open_flamingo import create_model_and_transforms from accelerate import Accelerator from einops import repeat from PIL import Image import sys sys.path.append('..') from src.utils import FlamingoProcessor from demo_utils import image_paths, clean_generation

def main(): accelerator = Accelerator() #when using cpu: cpu=True

device = accelerator.device

print('Loading model..')

# >>> add your local path to Llama-7B (v1) model here:
llama_path = '../models/llama-7b-hf'
if not os.path.exists(llama_path):
    raise ValueError('Llama model not yet set up, please check README for instructions!')

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path=llama_path,
    tokenizer_path=llama_path,
    cross_attn_every_n_layers=4
)
# load med-flamingo checkpoint:
checkpoint_path = hf_hub_download("med-flamingo/med-flamingo", "model.pt")
print(f'Downloaded Med-Flamingo checkpoint to {checkpoint_path}')
model.load_state_dict(torch.load(checkpoint_path, map_location=device), strict=False)
processor = FlamingoProcessor(tokenizer, image_processor)

# go into eval model and prepare:
model = accelerator.prepare(model)
is_main_process = accelerator.is_main_process
model.eval()

"""
Step 1: Load images
"""
demo_images = [Image.open(path) for path in image_paths]

"""
Step 2: Define multimodal few-shot prompt 
"""

# example few-shot prompt:
prompt = "You are a helpful medical assistant. You are being provided with images, a question about the image and an answer. Follow the examples and answer the last question. <image>Question: What is/are the structure near/in the middle of the brain? Answer: pons.<|endofchunk|><image>Question: Is there evidence of a right apical pneumothorax on this chest x-ray? Answer: yes.<|endofchunk|><image>Question: Is/Are there air in the patient's peritoneal cavity? Answer: no.<|endofchunk|><image>Question: Does the heart appear enlarged? Answer: yes.<|endofchunk|><image>Question: What side are the infarcts located? Answer: bilateral.<|endofchunk|><image>Question: Which image modality is this? Answer: mr flair.<|endofchunk|><image>Question: Where is the largest mass located in the cerebellum? Answer:"

"""
Step 3: Preprocess data 
"""
print('Preprocess data')
pixels = processor.preprocess_images(demo_images)
pixels = repeat(pixels, 'N c h w -> b N T c h w', b=1, T=1)
tokenized_data = processor.encode_text(prompt)

"""
Step 4: Generate response 
"""

# actually run few-shot prompt through model:
print('Generate from multimodal few-shot prompt')
generated_text = model.generate(
vision_x=pixels.to(device),
lang_x=tokenized_data["input_ids"].to(device),
attention_mask=tokenized_data["attention_mask"].to(device),
max_new_tokens=10,
)
response = processor.tokenizer.decode(generated_text[0])
response = clean_generation(response)

print(f'{response=}')

if name == "main":

main()

Hi, can you write your code again in the proper format?

thedaffodil · 2023-08-17T18:04:03Z

Hi, can you write your code again in the proper format?

Hello, I've just edited my comment to write code in the proper format.

pengyulong · 2023-10-22T15:07:45Z

Hi, I meet your same proplem, this my code:

from transformers import FuyuForCausalLM, AutoTokenizer, FuyuProcessor, FuyuImageProcessor
from PIL import Image
import os
from accelerate import Accelerator 
import deepspeed
import time
pretrained_path = "./fuyu-8b"
accelerator = Accelerator()
device = accelerator.device
tokenizer = AutoTokenizer.from_pretrained(pretrained_path,use_fast=True,device_map="auto")

image_processor = FuyuImageProcessor()
processor = FuyuProcessor(image_processor=image_processor, tokenizer=tokenizer)

model = FuyuForCausalLM.from_pretrained(pretrained_path)

model = accelerator.prepare(model)
is_main_process = accelerator.is_local_main_process
model.eval()
text_prompt = "Generate a coco-style caption.\n"
image_path = "./1163433113_img.jpg"  # https://huggingface.co/adept-hf-collab/fuyu-8b/blob/main/bus.png
image_pil = Image.open(image_path)

model_inputs = processor(text=text_prompt, images=[image_pil], device=device)
for k, v in model_inputs.items():
    model_inputs[k] = v.to(device)

generation_output = model.generate(**model_inputs, max_new_tokens=7)
generation_text = processor.batch_decode(generation_output[:, -7:], skip_special_tokens=True)

my machine contain 4 gpus（A10); I still got this error

please, help me.

ArlanCooper · 2024-01-17T11:56:11Z

try setting this up os.getenv('WORLD_SIZE', '1')

actually, i have 4 gpus on the machine, why I can only use one?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading a HF Model in Multiple GPUs and Run Inferences Using deepspeed.init_inference() on those GPUs #4145

Loading a HF Model in Multiple GPUs and Run Inferences Using deepspeed.init_inference() on those GPUs #4145

asifehmad commented Aug 14, 2023

nadimintikrish commented Aug 14, 2023

asifehmad commented Aug 15, 2023 •

edited

asifehmad commented Aug 15, 2023 •

edited

thedaffodil commented Aug 15, 2023 •

edited

asifehmad commented Aug 17, 2023

thedaffodil commented Aug 17, 2023

pengyulong commented Oct 22, 2023 •

edited

ArlanCooper commented Jan 17, 2024

Loading a HF Model in Multiple GPUs and Run Inferences Using deepspeed.init_inference() on those GPUs #4145

Loading a HF Model in Multiple GPUs and Run Inferences Using deepspeed.init_inference() on those GPUs #4145

Comments

asifehmad commented Aug 14, 2023

nadimintikrish commented Aug 14, 2023

asifehmad commented Aug 15, 2023 • edited

asifehmad commented Aug 15, 2023 • edited

thedaffodil commented Aug 15, 2023 • edited

asifehmad commented Aug 17, 2023

thedaffodil commented Aug 17, 2023

pengyulong commented Oct 22, 2023 • edited

ArlanCooper commented Jan 17, 2024

asifehmad commented Aug 15, 2023 •

edited

asifehmad commented Aug 15, 2023 •

edited

thedaffodil commented Aug 15, 2023 •

edited

pengyulong commented Oct 22, 2023 •

edited