-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to get Ollama to utilize GPU on Jetson Orin Nano 8Gb #1979
Comments
@remy415 any solution ? I am observing the same on AGX Orin. Seems like a bug on ollama. ~$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ollama serve |
@Q-point nothing yet. I haven’t had time to troubleshoot. If I had to guess, I would say there may have been an update in the way Jetpack presents its drivers; I’m not an expert in Linux drivers, it’s just the only thing that makes sense given that @bnodnarb was able to get it working with little customization, and I doubt Ollama made any tweaks if it was already working so the only logical culprit is a change in the drivers. I do know that NVidia made a change in the way it exposes CUDA to containers. Previously, containers would basically mount the installed drivers in the container. Now, the containers released by dustynv have the drivers baked in to the containers and the expectation from NVidia is the decoupling of host system drivers and container-used drivers. |
Is there any way to get |
I’ve tried to get it installed, but as dustynv pointed out in another post
somewhere (on my phone, will find it later) the Tegra line of SBCs running
Jetpack are integrated GPUs and aren’t compatible with nvml/
libnvidia-ml.so/nvidial-smi. This is changing in Jetpack 6.0 but that isn’t
releasing until at least March.
I spent some time poking around in the ollama source code to see what
exactly it needed from libnvidia-ml.so, but I was having difficulty finding
comparable syscalls on the Jetson because the system data tools I did find
are just Python scripts that call Python CUDA libraries; I didn’t dive too
far down that rabbit hole.
Another thing is that the llama_cpp that works for the Jetson is a custom
build done by dustynv that leverages the nvcc compiler. I tried injecting
his llama_cpp container and prebuilt binary into the ollama dockerfile
build but it didn’t work; I think there is something gpu_info passes to the
make process but I haven’t worked the kinks out of that yet, and I still
need to find what information the gpu_info.go routine is requiring from the
CUDA api to ensure it’s properly converted to the Jetson format.
Any insights you could provide on that front would be greatly appreciated.
…On Fri, Jan 26, 2024 at 4:51 PM Daniel Hiltgen ***@***.***> wrote:
Is there any way to get libnvidia-ml.so installed on the system? What
does the nvidial-smi output look like on these systems?
—
Reply to this email directly, view it on GitHub
<#1979 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AZFJEIVHV26XNH2YAYHEWOLYQQQNLAVCNFSM6AAAAABBZR3AA6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSG42DENBVGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That's unfortunate they didn't implement support for the management library. We've added a dependency on that to discover the available GPUs, and their memory information so we can determine how much we can load into the GPU. We do have a mechanism now to force a specific llm library with A potential path for us to consider here is to refine the memory prediction logic so you can tell us how much memory to use via env var and bypass the management library checks, and then force the cuda llm library, and that might be sufficient to get us working again on these systems. |
There's support for the library, kind of. I've compiled my own .so file essentially wrapping the nvml function parameters returned in the API calls. I'm not terribly great with C/C++ but I got it to compile with NVCC. TL;DR: I used cuda_runtime.h API calls to gather the same information returned by the NVML calls. It seems to have mostly worked but I'm running into errors. I'll be taking another try on it tomorrow some time, but meanwhile I've uploaded file changes if you're interested in taking a look. https://www.github.com/remy415/ollama_tegra_fix The error I received:
Seems like the GetGpuInfo() function (memInfo call) failed. I'll need to take a look and make sure I implemented the API correctly and make sure the information is passed in the same format the Go routine expects. I haven't had a lot of time to troubleshoot today, this is just an initial draft of what I was thinking could be done to avoid leaving too much in the user's hands, assuming API usage is preferred over user-defined env variables for memory, etc. |
@dhiltgen Thank you, I'll take a look at that. I'll update the repo and let you know here if I get anything working. |
@dhiltgen I was able to use the CUDA Runtime Driver API compiled with NVCC to grab an initialize, device count, "gethandlebyindex" was just assumed to be '0' -- there's no Driver API call for it but it's also not needed -- simply query device properties for a device index (can be done in a for loop using devicecount), and the memory info (memory max, memory used, used a diff to figure out the free memory), and the CUDA capability major & minor values. This will work on any CUDA device and doesn't require hooking into the NVML shared object. It only needs to be compiled with NVCC but that's included in the standard CUDA toolkit. The source code for the sample binary I made should work on any linux machine with CUDA drivers and a CUDA device. Any reason you wouldn't want to switch to using the CUDA Runtime API instead of querying nvml? I'm not super experienced with the -isms/quirks of pinging the Runtime API vs pinging the NVML API, but if all you're doing is gathering device info before loading llama_cpp then leveraging the Runtime API instead of NVML would work great for a solution that is compatible with Jetson and desktop CUDA. I think there's device properties to cover most of what the rest of your typedefs were looking for too. If you have a linux box with a CUDA device (I'll test it out in Windows at some point too, but I think it should work there as their device API calls seem to be system agnostic), please let me know if this would work as a suitable alternative to libnvidia-ml.so. Files: https://raw.githubusercontent.com/remy415/ollama_tegra_fix/master/tegra-ml-test.cu Compile command is Command output:
|
This sort of approach could be viable. A key aspect of our GPU discovery logic is relying on dlopen/dlsym (and LoadLibrary/GetProcAddress on windows) so that we can have a soft dependency on the underlying GPU libraries. This lets us fail gracefully at runtime and try multiple options before ultimately falling back to CPU mode if necessary. I believe this would translate into loading libcudart.so and wiring up these |
I'll try and whip something up as gpu_info_tegra.c & gpu_info_tegra.h with the same structure as gpu_info_cuda.c/etc. I can see the benefits of keeping your code as loosely coupled to CUDA as possible. Correct me if I'm wrong, but does ollama compile llama_cpp into it when you build it? I didn't see anything in the manual installation instructions. If so, Jetson doesn't work with the default compilations of llama_cpp and requires a build using a syntax I've copied from dustynv's llama_cpp container build. Where is the proper place to inject llama_cpp build flags? |
We use |
I saw something in the comments about all your CUDA builds requiring/using AVX. Tegras ship with ARM64 CPUs that don't have AVX extensions, so if the logic automatically disables GPU support if AVX isn't present (as per the comments) then the GPU library loading will be skipped every time. |
Good catch! Yes, that recently introduced logic needs to be x86 only. I'll get a PR up for that. |
Okay so I wrote a .c file for Tegra devices, edited the gpu.go to accommodate, and made a few tweaks here and there. Long story short: I recompiled llama_cpp and ollama, set a couple env variables, and got it to run. https://github.com/remy415/ollama_tegra_fix.git @dhiltgen I don't know how you want to approach incorporating this into Ollama for Jetson users, whether you want to incorporate it into the main branch or offer it as a patch or something. @bnodnarb Try the patch at my github repo and see if it works for you, it worked on my 8gb Orin Nano. |
@remy415 Are you running this within a docker. Even with the instructions above I still get compile errors when issuing :
|
No, I didn’t. Did you run go clean first? And did you pull from the repo linked in my pull request? |
I need more info about your setup and what repo you’re using |
@remy415
Regarding the comment in bold: Also ensure: IMPORTANT -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc is in the llm/generate/gen_linux.sh file under CUBLAS; That is already included under:
|
Yes, I created the guide before I made the package. I need to update the guide. If you’re running Jetpack 5, it should work if you clone the repo and install as is. I am currently working back my edits to make it align with the default install. I think if you ensure your ld_library_path is set it should work |
@Q-point use this repo in an empty folder, it’s the whole package https://github.com/remy415/ollama.git |
OK, just got it working with that repo. https://github.com/remy415/ollama.git |
Okay, let me know if the GPU acceleration works 😃 |
Thanks for your help ( #2491 (comment) ). I have cmake version 3.28.2 tried your tips from there: But still when i start "go generate ./..." and the errors like here #1979 (comment) |
@telemetrieTP23 i responded to your other post 😀 I had a typo in my original response and it’s updated now. |
@remy415 Hey I had a quick follow up question to all this. Your build is working great on my Orin AGX! Here is some output from the ollama service:
Given beggars can't be choosers, have you run across this issue? Is there a switch perhaps for CUDA=all or similar? Thanks! |
@davidtheITguy According to the NVidia Jetson AGX Orin Technical documentation at https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf, it would seem that it's two graphics compute clusters with a unified front end. I don't have the AGX Orin so I can't personally confirm, but one way I would do it is to run a device query that shows the getDeviceCount results (The ollama service logs do show the results of that API call further up towards the top, Last, the lines beginning with |
@remy415 Got it, ty. Weird that only one GPU works (I definitely get both with the python/HF scripts). I'll report back when able. |
@remy415 quick clarify. You are correct, it does appear that both GPUs are exposed as one interface. The difference appears to be the "GPU Shared RAM" capability not being utilized by the Lama.cpp back end in this case. I'll keep digging. |
@davidtheITguy I don't know which model you have loaded but your JTOP is reporting 9.1G GPU Shared Ram being used, which is definitely more than a model like Mistral 7b uses (typically it uses ~4G RAM). I think your hardware is being fully leveraged |
I cannot reproduce your builds using go version go1.22.1 linux/arm64. Does anyone have a binary for an Orin NX 16 GB? |
It should build on Orin NX. Please clone the repo at my repo. Then |
Thanks @remy415 , already trying with your repo. Error on go build is
|
@ToeiRei Yes, those are compiler warnings, but they are not critical errors. Those particular warnings are present even in the Ollama main builds. The binary should still have compiled and should work for you. |
my bad. I had expected something like a compile done as it just showed the warnings and a prompt. |
No worries, I fell down the same rabbit hole myself |
Thanks. After failing to restart ollama due to a lack of caffeine, we're finally cooking with |
That's great! How is the performance on the Orin NX? |
It's definitely slower than my RTX4070, but it does not feel too bad. Like a person live typing on a 13b model.
but somehow there's got to be a problem with the model as it says |
While in the ollama directory, and with
Edit the model file: Create the model It would seem the model discovery logic is currently broken as the Modelfile I had made previously also failed with the same error as yours, and I had to use the above commands to generate a new "template" with the correct path auto-populated. |
I got some error messages:
|
Hi @UserName-wang . I guess that your issue is caused by @remy415 still developing on his main branch (forgot to push the changes for importing I have run @remy415 's great forked version of ollama on my Jetson AGX Orin with L4T 36.2.0 (Jetpack 6.0 DP) few days. It works great with |
@UserName-wang @hangxingliu the fork is currently broken due to an incomplete merge. I’m working with @dhiltgen to get it fixed. |
@UserName-wang @hangxingliu Try this branch here |
@remy415 , Thank you a lot! now it works! and I'm sure ollama now running on GPU. I tried gemma and llama2 and running fast! I used **go generate ./... && go build . t**o build this package on host( agx orin) it works. but faild to run in docker. the build process is successful (and detected GPU). but failed to run command: ollama run gemma:2b. the error message attached. Can you please have a look and give me some suggestions if you have time? thank you! |
@UserName-wang Just to confirm:
I'm going to look at the gpu.go file and see why it didn't print the path to the library it loads as well. |
@UserName-wang Just FYI I haven't worked out running this on Docker containers as the Jetson is a bit of an oddity with that. I suggest you look at dustynv's Jetson containers for running GPU accelerated stuff in docker containers as the runtime needs a special configuration to work on Jetsons. |
answer to question 1, yes! and the docker already detected CUDA. |
@UserName-wang ok so it runs outside of Docker, but does not run inside of Docker? The error message in your log suggested it is missing a driver, that’s why I asked if you configured it for Docker by following NVidias instructions for CUDA. I don’t have Jetpack 6 installed yet, I was waiting for the official release |
Fixed with merge of #2279 |
@UserName-wang dusty-nv merged a PR with a container for Ollama. You can find it on his github page |
Yes! I tested and it works!Thank you for your information!在 2024年4月15日,09:20,Jeremy ***@***.***> 写道:
@UserName-wang dusty-nv merged a PR with a container for Ollama. You can find it on his github page
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I've reviewed the great tutorial made by @bnodnarb here:
https://github.com/jmorganca/ollama/blob/main/docs/tutorials/nvidia-jetson.md
The Orin Nano is running Ubuntu 20.04 with Jetpack 5.1.2 (r35.4.1 L4T). The container is also running L4T version 35.4.1. Jetpack 5.1.2 comes with CUDA 11.4 installed with compatibility support for CUDA 11.8.
I also followed along with the other 3 Jetson-related issues and have not found a fix.
I have also:
Run ollama serve
In each of the situations, I used the 'mistral-jetson' generated model. For each of them, I get a similar output:
Key outputs are:
2024/01/13 20:14:03 routes.go:953: no GPU detected
llm_load_tensors: mem required = 3917.98 MiB
Again, would just like to note that the stable-diffusion-webui application works with GPU, as well as the referenced docker container from dustynv. Any suggestions of things to check?
Update: I forgot to mention that I verified CPU and GPU activity using jtop in another terminal. Edited for formatting. Edited to add OS & Jetson versions. Edited to add CUDA version.
The text was updated successfully, but these errors were encountered: