Llama2 Slow Response Time on RTX 3070 #3352
Replies: 3 comments 3 replies
-
|
Nvidia driver has an issue for 3 months. You must be sure you are using Nvidia driver version 531 or lower. |
Beta Was this translation helpful? Give feedback.
-
Hello berkut1, Thanks for your input regarding the Nvidia driver version. I have followed your advice and tried versions 531.41 and 528.49, replacing the previously installed 536.67 version. Unfortunately, I'm still experiencing slow response times from Llama2. During these operations, I've noticed that the GPU's dedicated memory usage is maxed out (8GB on my RTX 3070). One additional piece of information that might be relevant is that I'm running Llama2 on a virtual machine via Proxmox. The VM is provisioned with 64GB of RAM and 12 CPU cores, and I'm using direct GPU passthrough for the RTX 3070. I am beginning to wonder if the virtualized environment could be contributing to the slow response times. To further investigate, I am planning to try running Llama2 on a bare metal machine with both the RTX 3070 and an RTX 3080 to see if this improves performance. I'll keep you updated with any findings. Thank you again for your assistance so far, and any further insights you may have would be appreciated. |
Beta Was this translation helpful? Give feedback.
-
|
I just played around with the options in the model configuration. After activating "load in 4 bit" the number of tokens per second went up to 10 per second. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
I've been running the Llama2 model on an NVIDIA RTX 3070 and am experiencing longer than expected response times. For a simple greeting like "hello", it takes around 26 seconds to respond. For a longer sentence, it took up to 70 seconds.
During these computations, I've noticed that my GPU and CPU usages are relatively low (around 25% and 12%, respectively). This suggests to me that the computational load isn't fully utilizing my hardware.
Given this, I have a couple of questions:
Are there any known issues or bottlenecks that might be causing these long response times with Llama2? This could be related to data loading or transfer times, or perhaps some other hardware or software limit.
Is there any potential way to optimize Llama2's performance on my current hardware (RTX 3070)? Are there any settings or configurations I can adjust to improve utilization and reduce response time?
If I were to upgrade to a more powerful GPU, like the RTX 4090, can I expect a significant reduction in response times? Or are there other factors at play here that a more powerful GPU wouldn't necessarily address?
I appreciate any insights you can provide!
Best,
Cyril
Beta Was this translation helpful? Give feedback.
All reactions