-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda inference doesn't work anymore! #812
Comments
here it works - can you post your model config file, and logs with debug enabled? |
So this is currently my docker-compose file:
How can I enable debug? I added the --debug thing at the end but that doesn't seem to do anything... I'm essentially using the guanaco.yaml from the model gallery and overriding it so it has additional options set.
I added recently thinking they might somehow enable the usage of the GPU. I also had ´batch´ set before, but in the config.go file I couldn't find anything for batch so I changed it to context_size thinking maybe that could fix my problem... This is the config.go section I'm talking about:
|
Weird, in the Image currently active in my docker-compose file when I try to run this command:
I now get another error:
However the Api call via postman still gives me an answer. But it's slow, since it's running on the CPU only... Also when I make the API call I still get the rpc error and nothing else in the logs for localai:
|
To enable |
Okay I did that, this comes when I make the API call:
And this is my cURL:
|
@emakkus can you try with images from master and see if you can reproduce there? e.g. quay.io/go-skynet/local-ai:master-cublas-cuda11 |
@mudler I tried the image you recommended. That one actually seems to be able to utilize the GPU, however it then fails because of some segmentation fault right when it starts the inference. Here are the logs:
Are there other images I could try? Or do you see the problem with the logs I provided? |
I think am able to replicate the issues with a fresh vm in GCP, G2-standard-4 instance with 1x Nvidia L4. OS is common-gpu-debian-11-py310
Output of trying execute a model using GPU acceleration:
|
I did try it with Rebuild true now, but the results are exactly the same as before:
I also tried to run the go-llama example, and there I also get the Segfault:
So now I had suspicions if maybe llama.cpp somehow was broken, but it seems to be fine:
All these commands are from within the same container with Rebuild=true. I also tried to run the Dockerfile build on the masterbranch also, but I get the exact same results. At least it's trying to use the GPU now... but I don't get why the Segfault would happen. Am I missing a param that it expects or something? This is my PELOAD_MODELS value:
And this my current full docker-compose.yaml:
I changed the gpu_layers back to 50, thinking the VRAM might have been getting full or something along those lines, but now that definetly shouldn't be the case, it's only taking about 15GB from 24GB. |
I now tried to use an older release Image, and there everything works:
So something must have happened afterwards that somehow leads to the segfaults... my model configuration is the same as before. I will try out 1.22.0 but my hopes are kinda low on that one... |
weird. I could finally reproduce in another box - I'll try to have a look at it later today |
That was a local build with just cublas enabled. I had pulled the latest from git. I will try later with a different version tag/label to see if I can get it working there. |
I tried out It sees the GPU, but doesn't use it. Even if I run go-llama directly inside the container, the exact same thing happens. Even setting -ngl doesn't change the behaviour.
However what works (as it does in every version I tried so far) is to directly run the naked llama.cpp:
So version 1.21.0 worked flawlessly with Cuda, version 1.22.0 sees the GPU but refuses to use it. It doesn't segfault or error out, but it only uses the CPU. And the master branch version tries to use the GPU, but segfaults... Normally I wouldn't mind using 1.21.0... but because of Llama2 and so on... I kinda want to stay up-to-date. |
Fixes: #812 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler I love you man, now it works! I have built your update_rope branch and it works! <3 |
I actually am having the same issue that emakkus had at the beginning. I've tried v1.23.1, v1.23.0, v.1.22, v.1.21, all of them haven't worked. My nvidia-smi works within the repo, I just don't see any processes running or found. Specifically I've been using this with the Obsidian plugin for LocalAI, but even when I run prompts directly through the terminal it doesn't work. Here is my nvidia-smi:
Here is that same weird error that emakkus got last week in the logs:
The API call works, but for some reason it's just never using the GPU. I thought I was going crazy until I found this thread. Here's the docker-compose.yaml file:
|
LocalAI version:
quay.io/go-skynet/local-ai:sha-72e3e23-cublas-cuda12-ffmpeg@sha256:f868a3348ca3747843542eeb1391003def43c92e3fafa8d073af9098a41a7edd
I also tried to build the Image myself, exact same behaviour
Environment, CPU architecture, OS, and Version:
Linux lxdocker 6.2.16-4-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-5 (2023-07-14T17:53Z) x86_64 GNU/Linux
Its a proxmoc lxc with docker running inside it. The CUDA inference did work in an earlier version of this project, and llama.cpp still does work.
Describe the bug
No matter how I configure the model, I can't get the inference to run on the GPU. The GPU is being recognized, however it's vram usage stays at 274MiB / 24576MiB
nvidia-smi does work inside the container.
When starting the container, the following message appears:
and when I make the completion call, only the CPU seems to take the load and slowly it responds. (Instead of using the GPU and being fast).
Also I somehow ALWAYS get the following message in the logs when I make the API call:
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:41227: connect: connection refused"
However the api call still works. I just can't see what the backend is doing.
If I attach to the container and go into the go-llama directory and make the test call from there:
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda-12.2/lib64" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "/models/guanaco-33B.ggmlv3.q4_0.bin" -t 10
I get the following output:
As you can see, it is able to find the GPU but it wont use it. When I write anything to it, only the CPU is used.
In the ./examples/main.go I could find the ngl parameter for gpu layers, I used it with 60 and 70 and it didn't help. Same behaviour!
Finally I ran this:
go run ./examples -m "/models/guanaco-33B.ggmlv3.q4_0.bin" -t 10 -ngl 70 -n 1024
I removed all the prefix stuff, and I get the exact same behaviour with the exact same output as above. It is as if go-llama somehow doesn't make use of the GPU anymore.
However the interesting part is this:
If I make this call:
root@lxdocker:/build/go-llama/build/bin# ./main -m /models/guanaco-33B.ggmlv3.q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 1024 -ngl 70
(I copied the whole bash line, so the path can be seen too) It works! The GPU is being used and its super fast as expected!
Here the model output from llama.cpp:
As you can see llama.cpp is able to use the GPU... but localAI somehow isn't. I'm trying to figure the problem out for several days now, but I just can't... sadly I can't code in go, so I don't really understand what's going on either... And the grpc stuff also seems to throw errors but somehow still work...
I hope someone with more knowledge of how the whole backend is set up can maybe help out... I tried to gather as much information as I can.
I would really love to be able to use this project!
To Reproduce
Simply try to make an inference via CUDA...
Expected behavior
The GPU should be used. Just as llama.cpp in its naked form in the Image itself is able to...
The text was updated successfully, but these errors were encountered: