Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed Issues with Local Inference of llama2-70B-chat Model #957

Closed
baifanxxx opened this issue Dec 8, 2023 · 2 comments
Closed

Speed Issues with Local Inference of llama2-70B-chat Model #957

baifanxxx opened this issue Dec 8, 2023 · 2 comments
Assignees
Labels
model-usage issues related to how models are used/loaded performance Runtime / memory / accuracy performance issues

Comments

@baifanxxx
Copy link

Hi there,

I hope this message finds you well. I am writing to report a performance issue I encountered while running the llama2-70B-chat model locally on an 8*A100 (80G) device. After downloading and configuring the model using the provided download.sh script, I attempted to run the example_chat_completion.py script with the following command:

torchrun --nproc_per_node 8 example_chat_completion.py --ckpt_dir llama-2-70b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6

However, I encountered a RuntimeError related to inplace update to an inference tensor outside of InferenceMode. Following the advice given in this GitHub issue, I replaced @torch.inference_mode() with @torch.no_grad() in model.py and generation.py. This resolved the initial error, allowing the model to run locally.

Nevertheless, I noticed a significant discrepancy in inference speed between the local environment and the online version available at llama2.ai. Locally, the model takes approximately 5 minutes for each inference, while the online version provides almost real-time results.

I have a few questions and concerns:

  1. Performance Discrepancy: Is it reasonable to expect a difference in inference speed between local and online environments, or could there be an underlying issue with my local setup?

  2. Impact of @torch.no_grad(): Does replacing @torch.inference_mode() with @torch.no_grad() have any significant impact on the inference speed? Could it be a contributing factor to the observed slowdown?

  3. Hugging Face Models: Would using the Hugging Face version of the model result in faster inference speeds compared to the locally configured llama2-70B-chat model?

  4. Optimizations for Local Inference: Are there any specific optimizations or configurations, such as flash attention, that could be applied to improve the local inference speed?

I appreciate your assistance in addressing these concerns and would be grateful for any guidance or recommendations to optimize the local performance of the llama2-70B-chat model.

Thank you for your time and attention to this matter.

Best regards,
BAI Fan

@subramen
Copy link
Contributor

The cloud service is likely running a bunch of optimizations to speed up inference, especially quantization. You might wanna check out https://github.com/pytorch-labs/gpt-fast which showcases many such optimizations to speed up generation (benchmarks in the README).

@subramen subramen self-assigned this Dec 13, 2023
@jeffxtang
Copy link
Contributor

@baifanxxx You may also try vLLM or TGI as shown in this tutorial.

@subramen subramen closed this as completed Jan 3, 2024
@subramen subramen added performance Runtime / memory / accuracy performance issues model-usage issues related to how models are used/loaded labels Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model-usage issues related to how models are used/loaded performance Runtime / memory / accuracy performance issues
Projects
None yet
Development

No branches or pull requests

3 participants