Speed Issues with Local Inference of llama2-70B-chat Model #957

baifanxxx · 2023-12-08T09:03:07Z

Hi there,

I hope this message finds you well. I am writing to report a performance issue I encountered while running the llama2-70B-chat model locally on an 8*A100 (80G) device. After downloading and configuring the model using the provided download.sh script, I attempted to run the example_chat_completion.py script with the following command:

torchrun --nproc_per_node 8 example_chat_completion.py --ckpt_dir llama-2-70b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6

However, I encountered a RuntimeError related to inplace update to an inference tensor outside of InferenceMode. Following the advice given in this GitHub issue, I replaced @torch.inference_mode() with @torch.no_grad() in model.py and generation.py. This resolved the initial error, allowing the model to run locally.

Nevertheless, I noticed a significant discrepancy in inference speed between the local environment and the online version available at llama2.ai. Locally, the model takes approximately 5 minutes for each inference, while the online version provides almost real-time results.

I have a few questions and concerns:

Performance Discrepancy: Is it reasonable to expect a difference in inference speed between local and online environments, or could there be an underlying issue with my local setup?
Impact of @torch.no_grad(): Does replacing @torch.inference_mode() with @torch.no_grad() have any significant impact on the inference speed? Could it be a contributing factor to the observed slowdown?
Hugging Face Models: Would using the Hugging Face version of the model result in faster inference speeds compared to the locally configured llama2-70B-chat model?
Optimizations for Local Inference: Are there any specific optimizations or configurations, such as flash attention, that could be applied to improve the local inference speed?

I appreciate your assistance in addressing these concerns and would be grateful for any guidance or recommendations to optimize the local performance of the llama2-70B-chat model.

Thank you for your time and attention to this matter.

Best regards,
BAI Fan

The text was updated successfully, but these errors were encountered:

subramen · 2023-12-13T15:26:57Z

The cloud service is likely running a bunch of optimizations to speed up inference, especially quantization. You might wanna check out https://github.com/pytorch-labs/gpt-fast which showcases many such optimizations to speed up generation (benchmarks in the README).

jeffxtang · 2023-12-13T17:20:18Z

@baifanxxx You may also try vLLM or TGI as shown in this tutorial.

subramen self-assigned this Dec 13, 2023

subramen closed this as completed Jan 3, 2024

subramen added performance Runtime / memory / accuracy performance issues model-usage issues related to how models are used/loaded labels Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed Issues with Local Inference of llama2-70B-chat Model #957

Speed Issues with Local Inference of llama2-70B-chat Model #957

baifanxxx commented Dec 8, 2023

subramen commented Dec 13, 2023

jeffxtang commented Dec 13, 2023

Speed Issues with Local Inference of llama2-70B-chat Model #957

Speed Issues with Local Inference of llama2-70B-chat Model #957

Comments

baifanxxx commented Dec 8, 2023

subramen commented Dec 13, 2023

jeffxtang commented Dec 13, 2023