Speed Issues with Local Inference of llama2-70B-chat Model #957
Labels
model-usage
issues related to how models are used/loaded
performance
Runtime / memory / accuracy performance issues
Hi there,
I hope this message finds you well. I am writing to report a performance issue I encountered while running the llama2-70B-chat model locally on an 8*A100 (80G) device. After downloading and configuring the model using the provided download.sh script, I attempted to run the example_chat_completion.py script with the following command:
However, I encountered a RuntimeError related to inplace update to an inference tensor outside of InferenceMode. Following the advice given in this GitHub issue, I replaced
@torch.inference_mode()
with@torch.no_grad()
in model.py and generation.py. This resolved the initial error, allowing the model to run locally.Nevertheless, I noticed a significant discrepancy in inference speed between the local environment and the online version available at llama2.ai. Locally, the model takes approximately 5 minutes for each inference, while the online version provides almost real-time results.
I have a few questions and concerns:
Performance Discrepancy: Is it reasonable to expect a difference in inference speed between local and online environments, or could there be an underlying issue with my local setup?
Impact of @torch.no_grad(): Does replacing
@torch.inference_mode()
with@torch.no_grad()
have any significant impact on the inference speed? Could it be a contributing factor to the observed slowdown?Hugging Face Models: Would using the Hugging Face version of the model result in faster inference speeds compared to the locally configured llama2-70B-chat model?
Optimizations for Local Inference: Are there any specific optimizations or configurations, such as flash attention, that could be applied to improve the local inference speed?
I appreciate your assistance in addressing these concerns and would be grateful for any guidance or recommendations to optimize the local performance of the llama2-70B-chat model.
Thank you for your time and attention to this matter.
Best regards,
BAI Fan
The text was updated successfully, but these errors were encountered: