-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardware requirements for Llama 2 #425
Comments
Using https://github.com/ggerganov/llama.cpp (without BLAS) for inference and quantization I ran a INT4 version of 7B on CPU and it required 3.6 GB of RAM. 13B, quantised to 3 bits per parameter, https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/blob/main/llama-2-13b-chat.ggmlv3.q3_K_S.bin on a Pentium(R) Dual-Core CPU E5400 @ 2.70GHz (bogomips=5400.11, address sizes: 36 bits physical, 48 bits virtual) On a newer computer, 13B quantised to INT8, https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/blob/main/llama-2-13b-chat.ggmlv3.q8_0.bin hardware info: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz |
I ran an unmodified llama-2-7b-chat. 2x E5-2690v2 Loaded in 15.68 seconds, used about 15GB of VRAM and 14GB of system memory (above the idle usage of 7.3GB) |
How about the heat generation during continuous usage? |
I have it in a rack in my basement, so I don't really notice much. I've used this server for much heavier workloads and it's not bad. The GPU is only 140W at full load. This is only really using 1-2 CPU cores. This server puts out a lot more heat with high CPU loads. |
Thanks for info! |
Thanks for the useful information! |
(Last update: 2023-08-12, added NVIDIA GeForce RTX 3060 Ti) Using llama.cpp, llama-2-13b-chat.ggmlv3.q4_0.bin, llama-2-13b-chat.ggmlv3.q8_0.bin and llama-2-70b-chat.ggmlv3.q4_0.bin from TheBloke. MacBook Pro (6-Core Intel Core i7 @ 2.60GHz, 16 GB RAM)
Gaming Laptop (12-Core Intel Core i5 @ 2.70GHz, 16GB RAM, GeForce RTX 3050 mobile 4GB)
Cloud Server (4-Core Intel Xeon Skylake @ 2.40GHz, 12GB RAM, NVIDIA GeForce RTX 3060 Ti 8GB)
Cloud Server with 2x GPUs (8-Core Intel Xeon Skylake @ 2.40GHz, 24GB RAM, 2x NVIDIA GeForce RTX 3080 10GB)
Cloud Server (4-Core Intel Xeon Skylake @ 2.40GHz, 24GB RAM, NVIDIA GeForce RTX 3090 24GB)
Cloud Server (24-Core Intel Xeon CPU E5-2650 v4 @ 2.20GHz, 96GB RAM, NVIDIA GeForce A40 48GB)
Cloud Server (4-Core Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 16GB RAM, NVIDIA GeForce RTX A4000 16GB)
Cloud Server (8-Core AMD Ryzen Threadripper 3960X @ 2.20GHz, 32GB RAM, NVIDIA GeForce RTX A6000 48GB)
Google Colab (2-Core Intel Xeon CPU @ 2.20GHz, 13GB RAM, NVIDIA Tesla T4 16GB)
(Note: the cloud server sometimes is not reliable, its results depend on other users in the same host.) |
Using llama.cpp Best result so far is just over 8 tokens/s. |
Thanks a lot @longyee! |
Llama2 7B-Chat on RTX 2070S with bitsandbytes FP4, Ryzen 5 3600, 32GB RAM Completely loaded on VRAM ~6300MB, took ~12 seconds to process ~2200 tokens & generate a summary(~30 tokens/sec). Also ran the same on A10(24GB VRAM)/LambdaLabs VM with similar results
|
Ran llama2-7b-chat on CPU via llama.cpp quantized to 4 bit on Macbook M1 Pro 32 GB RAM.
|
Ran llama2-70b-chat with llama.cpp with ggmlv3 quantized to 6 bit from TheBloke on CPU. Dual Xeon E5-2690v2. Consumed roughly 55GB of RAM.
|
M1 MacBook Pro 16GB RAM 10c with Llama2 using Replicate+Greganov bash script. 2GB RAM, 17tokens per second, 8threads
|
Using https://github.com/ggerganov/llama.cpp (with CUBLAS) for inference and quantization I ran a INT4 version of 13B on 4 x GTX 1060 at 7.5 tokens per second. In order to make it work, I had to load fewer layers to the main device with the
|
Gaming Laptop (8-Core Ryzen 9 7940HS @ 5.20GHz, 32GB RAM, GeForce RTX 4080 mobile 12GB)
Power Saving:
|
I ran: TheBloke_Llama-2-7b-chat-fp16. CPU: Core™ i9-13900K Loaded in 12.68 seconds, used about 14GB of VRAM. |
How many tokens per second did you get? |
I want to ask more if the above hardware for 1 Q&A session can meet the needs of multi-chat sessions. Or add load balancer and queue for concurrent processing |
Can you please explain it in more detail, like how you offloaded all layers? |
Can this be scaled accross multiple cards with something like k8s to abstract multiple GPU's? |
Just FYI for somebody looking at non-quantized default During inference on 8xA100 40GB SXM:
|
Can I run this Model in My 4GB ram PC? |
Similar to #79, but for Llama 2. Post your hardware setup and what model you managed to run on it.
The text was updated successfully, but these errors were encountered: