-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLama2 failing to load in Docker with cuBLAS #1109
Comments
localai-api-1 | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr create_gpt_params: loading model /models/llama2-13b-chat.gguf Segfault in go-llamacpp code I believe Correct me if I am wrong, the go-llamacpp this project use at moment is quite old, serveral months outdated from latest llamacpp. GGUF is not supported I think, only ggml (deprecated format) is supported |
After deleting the existing container and image, redownloading and rebuilding, the model is properly being loaded into VRAM. Although, the inference speed is extremely slow. I killed the process after several minutes of waiting for a response. I will do more testing later to see exactly how long does it take and report back here. |
you have the GPU layers set to |
@lunamidori5 Is there any documentation you can recommend to better understand the relationship between the possible number of GPU layers and a GPU model? |
After some testing, I was able to conclude what is the highest possible value I could set for However, even when safely maximizing the amount of VRAM loaded, inference takes over 1 hour and 30 minutes. Running the same prompt directly with llama.cpp, inference only takes 3 minutes. Please see below for more information: curated logs with
error html response with
With
However, as seen below, it took over an hour and a half for the inference for a simple prompt:
For comparison, running the same prompt directly with llama.cpp, the inference speed takes about 3 minutes:
|
Sadly no, I do not as every model is its own, but I am working on updating the how tos to make that even more clear! @RussellPacheco |
@RussellPacheco in your yaml file add
|
I made the above changes and the inference speed did speed up slightly and now takes 50 minutes instead of the 1 hour and a half it was taking before. However this is still significantly slower than when running llama.cpp directly with it's inference speed of 3 minutes.
|
@RussellPacheco what is your gpu layers set to again? (Ill keep looking into what may be making this happen) |
@lunamidori5 They are set to 9. Thank you for looking into this for me. |
@RussellPacheco do you have a smaller model you are okay trying? like a 7b or 3b? |
@lunamidori5 I am currently preparing this now and will let you know the outcome of this. |
Here are my results for attempting to run a 7B model. The conclusion was that even though the model fully loads into VRAM, there was no inference even after waiting more than 2 hours. The same model can do inference directly using llama.cpp and takes a minute. llama2-7b.yaml file contents:
LocalAI curated logs:
nvidia-smi output (shows the model getting fully loaded into VRAM) :
timestamp:
llama.cpp output and timestamp:
|
@RussellPacheco Alright, I am deeply sorry about that. Let me do some more looking the only thing I can think of is something to do with docker or the image |
RussellPacheco you are using 20 Threads but your cpu doe only hav 6 cores. This can slow localai down. you should set threads to 5 or 6 maybe max 7. |
@Dbone29 Yup, that was the problem. An oversight on my part. I set it to an appropriate number based on the amount of cores and threads per core my CPU has, and everything is working as expected. Inference only takes 2 minutes now. I will close this issue as everything is working. |
LocalAI version:
local-ai:master-cublas-cuda12
Environment, CPU architecture, OS, and Version:
Docker Container Info:
Linux 60bfc24c5413 4.18.0-477.21.1.el8_8.x86_64 #1 SMP Thu Aug 10 13:51:50 EDT 2023 x86_64 GNU/Linux
Host Device Info:
Linux rig 4.18.0-477.21.1.el8_8.x86_64 #1 SMP Thu Aug 10 13:51:50 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
NAME="AlmaLinux"
VERSION="8.8 (Sapphire Caracal)"
GPU: NVIDIA GeForce RTX 2080
CUDA Version: 12.2
Architecture: x86_64
Model name: AMD Ryzen 5 5600X 6-Core Processor
Describe the bug
Trying to run llama2-13b-chat.gguf in a docker container built with cuBLAS support ends with errors and the model not even being loaded into VRAM. See below for related files and debug log output.
To Reproduce
docker-compose.yaml
.env file contents
llama2-13b-chat.yaml file contents:
llama2-13b-chat.tmpl file contents:
Curl Command:
Expected behavior
Llama2.gguf model should successfully be loaded into VRAM and provide a response to the prompt.
Logs
Debug Log Output:
Additional context
The text was updated successfully, but these errors were encountered: