New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for libcudart.so for CUDA devices (Adds Jetson support) #2279
Conversation
Resolves #1979 |
@dhiltgen I don't know if you're the right contact for this, but I'm having issues getting the correct memory amounts for GetGPUInfo() on Jetsons. Since they are iGPU, the memory is shared with the system (8Gb in my case). The free memory reported by cudaGetMem and the memory reported by Sysinfo aren't necessarily even the correct free memory as the Jetsons use a portion of RAM as flexible cache. There is a semi-accurate way to get "available memory" but the only decent way I've seen to get that information is to run free -m or to read /proc/meminfo as the kernel has some fancy maths it does to give a semi-accurate reprensentation of available information. The 'buff/cache' field and 'available' field aren't reported by sysinfo (or cudaGetMem), and even the "/usr/bin/free" binary does an fopen() call on /proc/meminfo. For now I'm just setting it to report the greater of cudaGetMem or sysinfo free memory as the current "free memory". I read that the "available memory" field is considered the best guess for actual available memory according to git notes for meminfo.c: meminfo.c commit . However it requires parsing /proc/meminfo or calling '/usr/bin/free' which does the same thing. Do you have any ideas for the best way to report this information to the application? I tried putting in some overhead but the Jetson kept falling back to CPU due to memory even though there was extra memory available in the cache. |
Changed this to a draft while working memory issues. |
gpu/gpu.go
Outdated
var TegraLinuxGlobs = []string{ | ||
"/usr/local/cuda/lib64/libcudart.so*", | ||
"/usr/local/cuda/lib*/libnvidia-ml.so*", | ||
"/usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so*", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With some refinement, I believe this could be generalized to support any cuda system that either doesn't have nvidia-ml or where the mgmt lib is busted. It also opens the question of if we should also try to use the libcudart.so we carry as payload: https://github.com/ollama/ollama/blob/main/llm/generate/gen_linux.sh#L148-L157
If this turns out to be reliable for gathering GPU information, we could consider dropping the nvidia-ml dependency entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a viable alternative, but from the minimal research I did I believe libnvidia-ml.so is included with the CUDA driver (except for Jetpack 5 and below Tegra devices), and I think libcudart.so is a CUDA Toolkit package. Standard users likely won’t have the toolkit installed so that would become an additional dependency on the application since it’s dynamically linked, unless I’m mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I missed your reference to the libcudart payload. Only catch I can see with that is the CUDA has backwards compatibility, the CUDA toolkit has forward compatibility; using libcudart would need to be set to the lowest toolkit level required by your function calls, or it would restrict the application to a minimum CUDA driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With some refinement, I believe this could be generalized to support any cuda system that either doesn't have nvidia-ml or where the mgmt lib is busted. It also opens the question of if we should also try to use the libcudart.so we carry as payload: https://github.com/ollama/ollama/blob/main/llm/generate/gen_linux.sh#L148-L157
If this turns out to be reliable for gathering GPU information, we could consider dropping the nvidia-ml dependency entirely.
@dhiltgen I can work on that. Could you please provide some clarification on this though:
routes.go seems to call the initializer for llm.go (llm.Init()), which calls payload_common.go, and has it looking up which libraries are present on the host system. Based on that, it requests the necessary llama.cpp server bundles, then unpackages the libraries included in the server bundle into the /tmp/ollama######## folder it creates, and prints which libraries were loaded.
routes.go
https://github.com/ollama/ollama/blob/main/routes.go#L987-L998
llm.go
https://github.com/ollama/ollama/blob/main/llm/llm.go#L145-L157
payload_common.go
https://github.com/ollama/ollama/blob/main/llm/payload_common.go#L105-L149
The clarification I am requesting is this seems like a chicken and egg scenario: gpu.go finds libraries on host to determine if can load CUDA, ROCM, Metal, or CPU, then it unbundles the relevant package and loads that library on the backend. payload_common extracts payloads but doesn't return anything of use to its calling functions.
TL;DR: using the bundled libraries is a great idea but would require re-working how gpu.go is initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m working on a poc for loading CUDA from packaged dirs. Same concept should work for the other drivers too but I can’t test them. Any feedback or input would be appreciated. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we'll need to reshuffle things a bit to fully realize that approach. I think we can break this down into ~3 different steps.
- What I think makes the most sense is to first refactor this "tegra" code to be a "cuda" variant that will work on tegra and plain old cuda linux/windows systems where nvidial-ml didn't work or wasn't found. Initially, start with just the cudart lib found on the host, and I think that would be a mergeable state for this PR .
- Then an incremental follow up PR could reshuffle the library extraction flow so that we can add the cudart payload into the search path at the end, and move this cuda based lookup before nvidial-ml
- Finally, if things are looking good and people aren't hitting problems, we can remove the nvidial-ml lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had as much time to work on this as I'd like, but I have made progress.
Running into an issue with the way free memory is reported on Tegra devices. It seems like the OS already builds in a kind of "buffer" into it's reported free memory, and it's fairly significant. I'm going to remove the artificial buffer for Tegra devices and see if I run into any issues.
As shown in the screenshots, quite a big variance in reported free memory, resulting in layers not being properly sent to GPU as shown here:
Note: this started happening after loading a 2nd model. I initially loaded Mistral 7b and it worked just fine. When I switched to openhermes, that's when it stopped sending all the layers to GPU. I think for Tegra devices, we can probably let the L4T OS handle the extra buffer and just report 100% of the free memory reported by CUDA or the system versus adding the 1GB or 10% buffer.
I have tested this PR on the following device: Device used to test: CUDA libraries are detected and used, generation uses 100% GPU. After installation in |
I propose a change to the file On line 87, where the
|
I just checked my own jetson deployment and the service for it, and I ran into the same issue with my Jetson. For some reason, it has both a render and a video group, and the service didn't work until the ollama user was added to the video group. I'll add logic for it in the script in my PR as part of the Jetson compatibility. |
I'm rewriting the NVIDIA-Jetson tutorial to match the situation after your PR is applied. I'll add it as a Gist here to see if we can also add that to the PR. |
I've automated most of the things in the build process for getting this to work on Jetson to where if you have your lib paths properly updated you should be able to just pull it, |
@remy415 thanks! I'll try to take a look within the next few days. (I've been a bit distracted with the imminent Windows release) |
Oh I completely understand, no rush from my side. Thank you for your help and support! |
@remy415 : Here's a suggestion to replace the |
Thank you for writing that up. I would advise on a couple things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like where this is heading. Thanks!
gpu/gpu.go
Outdated
// Possible locations for the nvidia-ml library | ||
var CudaLinuxGlobs = []string{ | ||
|
||
"/usr/local/cuda/lib*/libnvidia-ml.so*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - I think this may yield 32/64 bit errors/warnings in the logs on more systems, so I'd move it to the bottom of the list.
gpu/gpu.go
Outdated
cudaLibPaths := FindGPULibs(cudaMgmtName, cudaMgmtPatterns) | ||
if len(cudaLibPaths) > 0 { | ||
cuda := LoadCUDAMgmt(cudaLibPaths) | ||
// Prefer libcudart.so if it's available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking this might be a follow-up change to toggle the order, but maybe this is OK.
gpu/gpu.go
Outdated
return | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want this wrapped in an else
. If we did find cudart libs, but they didn't load properly, we'd still want to try to load mgmt libs next (given the current sequence in this PR). Maybe this is getting to be a bit pedantic (what are the odds cudart fails but the mgmt lib works?) but logically I think it makes sense to try each discovery model independently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I did it this way is because we were trying to incorporate libnvidia-ml.so and libcudart.so under the unified "cuda" handle. If I want them to be separate, I think I would need to create a different configuration similar to the first entry I did with the gpu_info_tegra.c, etc build. I'll see what I can come up with to remove the else
, but one way or another would likely involve adding another GPU handle type.
Side note: I did see your PR regarding the creation of the common LLM libraries folder. I've already successfully loaded the packaged libraries in another branch, and it wouldn't be much to change the code to support the ~/.ollama/lib filepath. I did notice that ROCM libraries aren't packaged with llama.cpp and the only libraries that get bundled are the cuda libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did notice that ROCM libraries aren't packaged
Unfortunately the ROCm libraries are massive, so bundling them inside the executable like we do with cuda isn't viable. We're still trying to figure out the optimal approach for radeon cards which is part of the reason we still haven't officially launched support for them, even though they generally work on main. It's plausible we may shift to an approach of decoupled dependency downloads and lazy-load them if they weren't pre-loaded by the installer flow. That PR you mention is an incremental step that will help if we do go that route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I saw the comments you included in the gen_linux.sh
file regarding how big the ROCm libraries are being the reason they aren't bundled in.
With loading the packaged drivers though, I had a hard time resolving the "chicken and egg" scenario presented: needed the library loaded to check if the device is present and ready to go, and needed to know if the device is present or not to figure out which library to load. In the end, I just loaded them all and let them fail until one was successful and then set that one as the library. I'll see if I can sneak in support for the new PR to load the pre-packaged cuda libraries.
gpu/gpu.go
Outdated
@@ -248,6 +300,9 @@ func CheckVRAM() (int64, error) { | |||
overhead = gpus * 1024 * 1024 * 1024 | |||
} | |||
avail := int64(gpuInfo.FreeMemory - overhead) | |||
if (gpuInfo.Library == "cuda" && CudaTegra != "") { | |||
avail = int64(gpuInfo.FreeMemory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me why we don't set aside the overhead on Tegra? (this probably needs a comment breadcrumb to explain the special case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tegra has a caching system it uses that eats up "available memory" but is ready to be freed on demand. When I left in the 1024MB overhead, it would frequently give "falling back to CPU -> OOM" even though JTOP was showing enough memory free. I tried to find an automated system to properly report available memory to include freeable cache and the gist is that "free -m" produces the most accurate (and accurate is almost an incorrect word) depiction of free memory in Linux and should be what is used. There isn't a library or API to query the same data other than reading it directly from /proc (and in fact, that's what the free
binary does).
TL;DR I let the Tegra OS manage overhead as it seems to be better at it than me. Also the reported memory from the sysinfo library, the cuda driver free memory reported, and jtop were all inconsistent in the values reported.
gpu/gpu_info_cuda.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest a different approach on this one. Instead of modifying this native code to be a hybrid mgmt/cudart model, instead create 2 distinct impls. You could rename this file to something like gpu_info_nvidiaml.c and leave its logic ~unchanged. Then create a new streamlined gpu_info_cudart.c that exclusively uses the cuda runtime lib for all discovery tasks. You can get rid of the lib_t
toggling logic and only define and wire up the function symbols that are applicable to each library individually.
What I'm hoping is if the cudart discovery model works out, we have the option to entirely remove the nvidiaml code in the future, so I'd prefer if we can keep each approach clean and independent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, that's how I had it at first I had just named it gpu_info_tegra
. I think I misinterpreted your original recommendation to merge the tegra stuff under a single cuda
umbrella. I'll swap it back and go with the nvml
& cudart
approach like you suggested above.
llm/generate/gen_common.sh
Outdated
"6") | ||
echo "Jetpack 6 detected. Setting CMAKE_CUDA_ARCHITECTURES='87'" | ||
CMAKE_CUDA_ARCHITECTURES="87" | ||
;; | ||
"5") | ||
echo "Jetpack 5 detected. Setting CMAKE_CUDA_ARCHITECTURES='72;87'" | ||
CMAKE_CUDA_ARCHITECTURES="72;87" | ||
;; | ||
"4") | ||
echo "Jetpack 4 detected. Setting CMAKE_CUDA_ARCHITECTURES='53;62;72'" | ||
CMAKE_CUDA_ARCHITECTURES="53;62;72" | ||
;; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to build a single ARM binary that has architectures 53-87 so the same binary will work on the different jetpacks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to compile it on my Jetson leaving it the default architectures and it failed to compile. I checked into dustynv's container for llama_cpp and he explicitly sets the architectures to these based on the version of L4T running. Once I did that, it compiled without issue. If you have an ARM build system that will incorporate the other changes in this PR but build for all the architectures, I can test if it works.
I hadn't considered that this would create device-specific binaries (versus things that just load at runtime), I'll have to explore other options.
llm/generate/gen_common.sh
Outdated
if [ -n ${TEGRA_DEVICE} ]; then | ||
echo "CMAKE_CUDA_ARCHITECTURES unset, values are needed for Tegra devices. Using default values defined at https://github.com/dusty-nv/jetson-containers/blob/master/jetson_containers/l4t_version.py" | ||
fi | ||
# Tegra devices will fail generate unless architectures are set: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify this comment, when you go generate ./...
on a Jetson, it breaks if you try to target the wrong architectures?
For our official builds on ARM, we leverage Docker Desktop running on ARM Mac's, so my hope was we'd be able to generate an ARM binary that was inclusive of all the architectures and run on the different JetPack systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is what happens. It seems to be due to partial architecture support included in the Jetsons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unfortunate they don't include enough headers/def's to be able to target other families from a given device. It sounds like there's still hope our official build could work, but users building from source on device will need to narrow scope
llm/llm.go
Outdated
@@ -79,6 +79,24 @@ func New(workDir, model string, adapters, projectors []string, opts api.Options) | |||
break | |||
} | |||
|
|||
if info.Library == "tegra" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing where this is set. Maybe I'm missing it, but is this stale code perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot that this is still in here. Originally it just automatically set NumGPU to 999 for tegra devices if not manually set in env variable. I'll remove it
@dhiltgen My apologies for the giant commit spams on this, I'm trying to keep my branch updated with ollama main while integrating the libcudart changes. I think this commit may fulfill the objective of adding libcudart support. Jetson users will possibly need to include environment variables on build, but given the nature of Jetson devices as development boards, I believe they should be equipped to do so anyway. I also included logic to disable AVX extensions in the CUDA build within gen_linux.sh if the architecture is arm64 as those chips don't support it in general. |
@remy415 let me know when you think this is in pretty good shape and I'll take another review pass through. |
@dhiltgen I think it's in a pretty good place for step 1 of getting the libcudart support integrated. The only thing is it needs to be tested on machines with multiple GPUs to see if the code for the meminfo lookup works correctly on those systems (and I'd need to swap the priority for nvml/cudart as I put it back to loading nvml first). |
I pulled the newest version of this PR, ran
Not sure what the problem is though. |
@jhkuperus Which Jetson device do you have and which Jetpack version you are running? I haven’t tested it with JP6 yet. If you’re not running JP6, you should clean your installation as JP5 and below are not compatible with CUDA 12. Ensure you aren’t installing any new video drivers or any new CUDA toolkit software, other than the 11.8 compatibility as referenced on the Tegra support pages (it should work without this though). In the mean time I’ll run some tests and see if I messed up this last merge. |
@jhkuperus I executed a fresh build on my Jetson Orin Nano running JP5 and didn't run into the same segfault issue you had. I would guess your issue is either something present in JP6 that I haven't tested yet, or something like the JP5 + CUDA12 issue I referenced earlier. |
Model: NVIDIA Jetson AGX Orin Developer Kit - Jetpack 6.0 DP [L4T 36.2.0]
I tested an earlier build from this PR a week or two ago. That one works just fine. Tell me if running it with some sort of verbose-flags or debugging will help you find out what the problem is. I'll gladly help. |
@jhkuperus okay so you are running JP6, really odd that an earlier build works when this doesn’t as the only substantial changes I’ve made are to sync it with upstream. When you say you pulled the newest version, did you clone it to a fresh folder or did you git pull on top of the existing folder? Try deleting the entire ollama folder and clone it again. Additionally, if you’re running it manually with Alternatively, JP6 is supposed to include libnvidia-ml.so support according to dustynv. I haven’t looked through it myself but if that’s true then the binary straight from Ollama may work. |
I removed the entire repo, checked it out anew and ran the build again. It's working and I also know where my snag was. I ran |
@dhiltgen I think I may have figured out the issue with having to set specific architectures on Jetson devices. It may be related to an upstream llama.cpp issue fixed here, and somewhat explained here:
Seems like the issue is related to fp16 typedef in llama.h for ARM64 platforms. Confirmed (somewhat) when I unset CMAKE_CUDA_ARCHITECTURES (thus compiling the default), and set |
Update: I've done more digging into the F16 issue. I'm not sure why my particular compiler is having this issue, but it would seem the crux of the problem is somewhat alluded to in a NVidia CUDA-8 Mixed Precision Guide:
And on another github issues thread , they also reference that Also NVidia CUDA Programming Guide has some references: Additionally, when I tried to compile with
This error goes away if I change @dhiltgen do you know why |
It also seems like llama.cpp upstream changed they way they included
|
Just to clarify: If CUDA Architecture is "50;51", setting |
This was needed to add support for older GPUs and based on the testing we did at the time, didn't seem to have a major performance impact for newer GPUs. For Jetson support Compute Capability 5.0 support isn't relevant as far as I know, so this flag can be omitted. |
Okay so I guess the takeaway is for ARM-based CUDA builds, leave architectures at default and disable f16 support, and it should be golden. |
@dhiltgen I merged the PR with the latest Ollama release removing most of the AMD code from 'gpu.go'. I tested the build on my ARM and WSL+CUDA setups, and it looks like it's still good to go. I also adjusted the memory overhead section to better align with the original code by setting overhead to 0 if the L4T env variable is detected. I know it would be preferred to leave in an overhead buffer, but the L4T OS automatic caching messes with the reported free memory (it is reported lower than it actually is) so sometimes when loading 7B models, the free memory reported is less than is needed by the model loader and it falls back to CPU. Maybe we can add a flag somewhere to just have the user manually disable overhead buffer assignment and leave it on by default? Not sure how to handle this. Anyway, other than that the PR is ready for review with the latest release merged. |
Great job!
…On Fri, Mar 8, 2024 at 11:34 AM Jeremy ***@***.***> wrote:
@dhiltgen <https://github.com/dhiltgen> I merged the PR with the latest
Ollama release removing most of the AMD code from 'gpu.go'. I tested the
build on my ARM and WSL+CUDA setups, and it looks like it's still good to
go.
I also adjusted the memory overhead section to better align with the
original code by setting overhead to 0 if the L4T env variable is detected.
I know it would be preferred to leave in an overhead buffer, but the L4T OS
automatic caching messes with the reported free memory (it is reported
lower than it actually is) so sometimes when loading 7B models, the free
memory reported is less than is needed by the model loader and it falls
back to CPU. Maybe we can add a flag somewhere to just have the user
manually disable overhead buffer assignment and leave it on by default? Not
sure how to handle this.
Anyway, other than that the PR is ready for review with the latest release
merged.
—
Reply to this email directly, view it on GitHub
<#2279 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARYO7J2ATT5QASJMPRVSXF3YXHR7TAVCNFSM6AAAAABCRS3NICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWGAYTIMZYHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
I made a few minor changes to get windows working, and favor the cudart lib we cary as payload. remy415#1
Also needs a rebase.
gpu/gpu_info_cudart.c
Outdated
if (h.verbose) { | ||
cudartBrandType_t brand = 0; | ||
// When in verbose mode, report more information about | ||
// the card we discover, but don't fail on error | ||
// Need to map out alternatives of these for CUDART libraries | ||
// For now just returning a generic "unsupported" error. | ||
ret = CUDART_UNSUPPORTED; | ||
if (ret != CUDART_SUCCESS) { | ||
LOG(h.verbose, "nvmlDeviceGetName unsupported with CUDART libraries: %d\n", ret); | ||
} else { | ||
LOG(h.verbose, "[%d] CUDA device name: %s\n", i, buf); | ||
} | ||
if (ret != CUDART_SUCCESS) { | ||
LOG(h.verbose, "nvmlDeviceGetBoardPartNumber unsupported with CUDART libraries: %d\n", ret); | ||
} else { | ||
LOG(h.verbose, "[%d] CUDA part number: %s\n", i, buf); | ||
} | ||
if (ret != CUDART_SUCCESS) { | ||
LOG(h.verbose, "nvmlDeviceGetSerial unsupported with CUDART libraries: %d\n", ret); | ||
} else { | ||
LOG(h.verbose, "[%d] CUDA S/N: %s\n", i, buf); | ||
} | ||
if (ret != CUDART_SUCCESS) { | ||
LOG(h.verbose, "nvmlDeviceGetVbiosVersion unsupported with CUDART libraries: %d\n", ret); | ||
} else { | ||
LOG(h.verbose, "[%d] CUDA vbios version: %s\n", i, buf); | ||
} | ||
if (ret != CUDART_SUCCESS) { | ||
LOG(h.verbose, "nvmlDeviceGetBrand unsupported with CUDART libraries: %d\n", ret); | ||
} else { | ||
LOG(h.verbose, "[%d] CUDA brand: %d\n", i, brand); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go ahead and remove this.
gpu/gpu_info_cudart.c
Outdated
resp->err = strdup(buf); | ||
return; | ||
} | ||
ret = (*h.cudaMemGetInfo)(&resp->free, &resp->total); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should pass memInfo.free and total here so we can aggregate multi-GPU setups on line 164,165.
gpu/gpu_info_cudart.h
Outdated
typedef enum cudartBrandType_enum | ||
{ | ||
CUDART_BRAND_UNKNOWN = 0, | ||
} cudartBrandType_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is unused and can be removed.
|
Sorry about that. I should have left some comments to aid in the rebase. Use We're reshuffling things more with #3218 so depending on when we get this vs. that PR merged there may be some more rebase adjustments necessary. |
@dhiltgen I tested this on my Jetson and it compiled. Running a WSL test now. Should be good to go from here. |
@dhiltgen Thank you for your support! |
This is forcing everyone else to use |
@buroa The intent is to shift NVidia users to using embedded Unless you mean non-CUDA users are forced to use |
@buroa cudart is bundled into the released binaries, and we should gracefully handle failures to dynamically load this library. Systems without nvidia GPUs shouldn't see a hard dependency on cudart. If you think we missed some corner case, please let us know more details. |
Can ollama support the GPU operation on Jetson now? I followed the tutorial to run it on the Jetson Orin Nano, but it still shows 0% GPU utilization. I’m using JetPack 5.1.1. |
@Czhazha yes but you need to compile it yourself for now, or after running Working on fixing this |
Thank you. After following your process, I encountered a new error message when running it again. CUDA error: CUBLAS_STATUS_ALLOC_FAILED. ... |
Added libcudart.so support to gpu.go for CUDA devices that are missing libnvidia-ml.so. CUDA libraries split into nvml (libnvidia-ml.so) and cudart (libcudart.so), can work with either. Tested on Jetson device and on Windows 11 in WSL2.
Devices used to test:
Jetson Orin Nano 8Gb
Jetpack 5.1.2, L4T 35.4.1
CUDA 11-8
CUDA Capability Supported 8.7
Go version 1.26.1
Cmake 3.28.1
nvcc 11.8.89
AMD Ryzen 3950x
NVidia RTX 3090ti
WSL2 running Ubuntu 22.04
WSL CUDA Toolkit v12.3 installed
Edited for updates