Llama2 Slow Response Time on RTX 3070 #3352

sleepingforest1024 · 2023-07-28T17:39:13Z

sleepingforest1024
Jul 28, 2023

Hello everyone,

I've been running the Llama2 model on an NVIDIA RTX 3070 and am experiencing longer than expected response times. For a simple greeting like "hello", it takes around 26 seconds to respond. For a longer sentence, it took up to 70 seconds.

During these computations, I've noticed that my GPU and CPU usages are relatively low (around 25% and 12%, respectively). This suggests to me that the computational load isn't fully utilizing my hardware.

Given this, I have a couple of questions:

Are there any known issues or bottlenecks that might be causing these long response times with Llama2? This could be related to data loading or transfer times, or perhaps some other hardware or software limit.
Is there any potential way to optimize Llama2's performance on my current hardware (RTX 3070)? Are there any settings or configurations I can adjust to improve utilization and reduce response time?
If I were to upgrade to a more powerful GPU, like the RTX 4090, can I expect a significant reduction in response times? Or are there other factors at play here that a more powerful GPU wouldn't necessarily address?

I appreciate any insights you can provide!

Best,
Cyril

berkut1 · 2023-07-28T18:16:23Z

berkut1
Jul 28, 2023

Nvidia driver has an issue for 3 months. You must be sure you are using Nvidia driver version 531 or lower.

0 replies

sleepingforest1024 · 2023-07-29T10:27:11Z

sleepingforest1024
Jul 29, 2023
Author

Nvidia driver has an issue for 3 months. You must be sure you are using Nvidia driver version 531 or lower.

Hello berkut1,

Thanks for your input regarding the Nvidia driver version. I have followed your advice and tried versions 531.41 and 528.49, replacing the previously installed 536.67 version. Unfortunately, I'm still experiencing slow response times from Llama2. During these operations, I've noticed that the GPU's dedicated memory usage is maxed out (8GB on my RTX 3070).

One additional piece of information that might be relevant is that I'm running Llama2 on a virtual machine via Proxmox. The VM is provisioned with 64GB of RAM and 12 CPU cores, and I'm using direct GPU passthrough for the RTX 3070. I am beginning to wonder if the virtualized environment could be contributing to the slow response times.

To further investigate, I am planning to try running Llama2 on a bare metal machine with both the RTX 3070 and an RTX 3080 to see if this improves performance. I'll keep you updated with any findings.

Thank you again for your assistance so far, and any further insights you may have would be appreciated.

3 replies

bombel28 Aug 1, 2023

I switched from RTX 3080 ti to 4090, the speedup is incredible. But please notice, there are two important factors.
What model size are you using? With 8GB vram, a 7B model may get slow when context gets too large. The same on a 4090 when interfering with a 33b model an 8k context size with over 4K chat history. It is the moment, your vram is getting full. At this breakpoint, everything gets slow. With every hardware. With every model. The best balance at the moment is to use 4Bit models like autogptq with exllama or 4Bit ggml with a group size of 128. Choose a model size, which easily fits your vram. For only a few kilobyte chat history, you may need gigabytes of vram additionally! Use the parameter „truncate prompt to“ and try going down with it‘s value. Read the readme „low vram guide“ for more information.

tunichgud Aug 7, 2023

I also use nvidia RTX3070, and i also experience extreme bad performance (0.4 Tokens per second or worse). I am not running it virtualized. It is running under Windows 11 (with 64 GB Ram and 8x AMD CPU-cores).

realhaik Aug 18, 2023

8GB Ram will never work.
I have 4090 with 24GB and it barely works with llama-2-7b-chat.
With max_batch_size set to 1, the model starts with 15GB.
With max_batch_size set to 3, the model starts with 19GB, and exapands to 22GB during the usage.
max_batch_size set to 4 or higher, will immediately upon startup occupy more than 24GB, which will lead the memory spill to the Shared Graphics Memory, after that the response times resemble what you have posted.

in other words, spill to the Shared Graphics Memory is what causing your slow times. And there is no workaround with 8GB of graphics mem.


temperature  = 0.1
top_p  = 0.9
max_seq_len  = 4000
max_batch_size  = 3
max_gen_len = None

tunichgud · 2023-08-07T15:35:24Z

tunichgud
Aug 7, 2023

I just played around with the options in the model configuration. After activating "load in 4 bit" the number of tokens per second went up to 10 per second.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama2 Slow Response Time on RTX 3070 #3352

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Llama2 Slow Response Time on RTX 3070 #3352

Uh oh!

sleepingforest1024 Jul 28, 2023

Replies: 3 comments · 3 replies

Uh oh!

berkut1 Jul 28, 2023

Uh oh!

sleepingforest1024 Jul 29, 2023 Author

Uh oh!

bombel28 Aug 1, 2023

Uh oh!

tunichgud Aug 7, 2023

Uh oh!

Uh oh!

realhaik Aug 18, 2023

Uh oh!

tunichgud Aug 7, 2023

sleepingforest1024
Jul 28, 2023

Replies: 3 comments 3 replies

berkut1
Jul 28, 2023

sleepingforest1024
Jul 29, 2023
Author

tunichgud
Aug 7, 2023