Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] DeepSpeed Inference Does not Work with LLaMA (Latest verison) #867

Open
allanj opened this issue Feb 29, 2024 · 3 comments
Open

[Bug] DeepSpeed Inference Does not Work with LLaMA (Latest verison) #867

allanj opened this issue Feb 29, 2024 · 3 comments

Comments

@allanj
Copy link

allanj commented Feb 29, 2024

Version

deepspeed: 0.13.4
transformers: 4.38.1
Python: 3.10
Pytorch: 2.1.2+cu121
CUDA: 12.1

Error in Example (To reproduce)

Just simply run this script
https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py

deepspeed --num_gpus 8 inference-test.py --model meta-llama/Llama-2-7b-hf --use-kernel

It will show the following error:

Traceback (most recent call last):
  File "/root/DeepSpeedExamples/inference/huggingface/text-generation/inference-test.py", line 82, in <module>
    outputs = pipe(inputs,
  File "/root/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 71, in __call__
    outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)
  File "/root/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 119, in generate_outputs
    outputs = self.model.generate(input_tokens.input_ids, **generate_kwargs)
  File "/miniconda/envs/py310/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 636, in _generate
    return self.module.generate(*inputs, **kwargs)
  File "/miniconda/envs/py310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/miniconda/envs/py310/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
    return self.sample(
  File "/miniconda/envs/py310/lib/python3.10/site-packages/transformers/generation/utils.py", line 2693, in sample
    model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
  File "/miniconda/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1256, in prepare_inputs_for_generation
    if past_key_value := getattr(self.model.layers[0].self_attn, "past_key_value", None):
  File "/miniconda/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DeepSpeedGPTInference' object has no attribute 'self_attn'

Potential bug?

I suspect it did not find the right inference engine?, which should be DeepSpeedLlamaInference but not DeepSpeedGPTInference?

@allanj
Copy link
Author

allanj commented Feb 29, 2024

Would be nice if someone can tell which version will work

@mrwyattii
Copy link
Contributor

Hi @allanj I don't think we have kernel injection support for llama-2 models. If you remove the --use_kernel flag does the script work?

Additionally, what kind of GPUs are you using? You may be able to utilize DeepSpeed-MII to run the llama-2 model and get significant improvements to inference performance if you have GPUs with compute capability >=8.0:

import mii

client = mii.serve("meta-llama/Llama-2-7b-hf", tensor_parallel=8)
response = client("test prompt")

@allanj
Copy link
Author

allanj commented Mar 2, 2024

Yes. Removing the --use_kernel make it work.

Yeah, I realize the DeepSpeed FastGen. Wondering, how does it support the batch size? Or I simply make a for loop about that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants