Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for libcudart.so for CUDA devices (Adds Jetson support) #2279

Merged
merged 1 commit into from Mar 25, 2024

Conversation

remy415
Copy link
Contributor

@remy415 remy415 commented Jan 30, 2024

Added libcudart.so support to gpu.go for CUDA devices that are missing libnvidia-ml.so. CUDA libraries split into nvml (libnvidia-ml.so) and cudart (libcudart.so), can work with either. Tested on Jetson device and on Windows 11 in WSL2.

Devices used to test:
Jetson Orin Nano 8Gb
Jetpack 5.1.2, L4T 35.4.1
CUDA 11-8
CUDA Capability Supported 8.7
Go version 1.26.1
Cmake 3.28.1
nvcc 11.8.89

AMD Ryzen 3950x
NVidia RTX 3090ti
WSL2 running Ubuntu 22.04
WSL CUDA Toolkit v12.3 installed

Edited for updates

@remy415
Copy link
Contributor Author

remy415 commented Jan 30, 2024

Resolves #1979

@remy415
Copy link
Contributor Author

remy415 commented Feb 4, 2024

@dhiltgen I don't know if you're the right contact for this, but I'm having issues getting the correct memory amounts for GetGPUInfo() on Jetsons. Since they are iGPU, the memory is shared with the system (8Gb in my case). The free memory reported by cudaGetMem and the memory reported by Sysinfo aren't necessarily even the correct free memory as the Jetsons use a portion of RAM as flexible cache. There is a semi-accurate way to get "available memory" but the only decent way I've seen to get that information is to run free -m or to read /proc/meminfo as the kernel has some fancy maths it does to give a semi-accurate reprensentation of available information.

image

The 'buff/cache' field and 'available' field aren't reported by sysinfo (or cudaGetMem), and even the "/usr/bin/free" binary does an fopen() call on /proc/meminfo. For now I'm just setting it to report the greater of cudaGetMem or sysinfo free memory as the current "free memory".

I read that the "available memory" field is considered the best guess for actual available memory according to git notes for meminfo.c: meminfo.c commit . However it requires parsing /proc/meminfo or calling '/usr/bin/free' which does the same thing.

Do you have any ideas for the best way to report this information to the application? I tried putting in some overhead but the Jetson kept falling back to CPU due to memory even though there was extra memory available in the cache.

@remy415 remy415 marked this pull request as draft February 4, 2024 06:27
@remy415
Copy link
Contributor Author

remy415 commented Feb 4, 2024

Changed this to a draft while working memory issues.

gpu/gpu.go Outdated
Comment on lines 58 to 70
var TegraLinuxGlobs = []string{
"/usr/local/cuda/lib64/libcudart.so*",
"/usr/local/cuda/lib*/libnvidia-ml.so*",
"/usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so*",
}
Copy link
Collaborator

@dhiltgen dhiltgen Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some refinement, I believe this could be generalized to support any cuda system that either doesn't have nvidia-ml or where the mgmt lib is busted. It also opens the question of if we should also try to use the libcudart.so we carry as payload: https://github.com/ollama/ollama/blob/main/llm/generate/gen_linux.sh#L148-L157

If this turns out to be reliable for gathering GPU information, we could consider dropping the nvidia-ml dependency entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a viable alternative, but from the minimal research I did I believe libnvidia-ml.so is included with the CUDA driver (except for Jetpack 5 and below Tegra devices), and I think libcudart.so is a CUDA Toolkit package. Standard users likely won’t have the toolkit installed so that would become an additional dependency on the application since it’s dynamically linked, unless I’m mistaken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I missed your reference to the libcudart payload. Only catch I can see with that is the CUDA has backwards compatibility, the CUDA toolkit has forward compatibility; using libcudart would need to be set to the lowest toolkit level required by your function calls, or it would restrict the application to a minimum CUDA driver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some refinement, I believe this could be generalized to support any cuda system that either doesn't have nvidia-ml or where the mgmt lib is busted. It also opens the question of if we should also try to use the libcudart.so we carry as payload: https://github.com/ollama/ollama/blob/main/llm/generate/gen_linux.sh#L148-L157

If this turns out to be reliable for gathering GPU information, we could consider dropping the nvidia-ml dependency entirely.

@dhiltgen I can work on that. Could you please provide some clarification on this though:

routes.go seems to call the initializer for llm.go (llm.Init()), which calls payload_common.go, and has it looking up which libraries are present on the host system. Based on that, it requests the necessary llama.cpp server bundles, then unpackages the libraries included in the server bundle into the /tmp/ollama######## folder it creates, and prints which libraries were loaded.

routes.go
https://github.com/ollama/ollama/blob/main/routes.go#L987-L998

llm.go
https://github.com/ollama/ollama/blob/main/llm/llm.go#L145-L157

payload_common.go
https://github.com/ollama/ollama/blob/main/llm/payload_common.go#L105-L149

The clarification I am requesting is this seems like a chicken and egg scenario: gpu.go finds libraries on host to determine if can load CUDA, ROCM, Metal, or CPU, then it unbundles the relevant package and loads that library on the backend. payload_common extracts payloads but doesn't return anything of use to its calling functions.

TL;DR: using the bundled libraries is a great idea but would require re-working how gpu.go is initialized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m working on a poc for loading CUDA from packaged dirs. Same concept should work for the other drivers too but I can’t test them. Any feedback or input would be appreciated. Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we'll need to reshuffle things a bit to fully realize that approach. I think we can break this down into ~3 different steps.

  1. What I think makes the most sense is to first refactor this "tegra" code to be a "cuda" variant that will work on tegra and plain old cuda linux/windows systems where nvidial-ml didn't work or wasn't found. Initially, start with just the cudart lib found on the host, and I think that would be a mergeable state for this PR .
  2. Then an incremental follow up PR could reshuffle the library extraction flow so that we can add the cudart payload into the search path at the end, and move this cuda based lookup before nvidial-ml
  3. Finally, if things are looking good and people aren't hitting problems, we can remove the nvidial-ml lookup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had as much time to work on this as I'd like, but I have made progress.

Running into an issue with the way free memory is reported on Tegra devices. It seems like the OS already builds in a kind of "buffer" into it's reported free memory, and it's fairly significant. I'm going to remove the artificial buffer for Tegra devices and see if I run into any issues.

image

image

As shown in the screenshots, quite a big variance in reported free memory, resulting in layers not being properly sent to GPU as shown here:

image

Note: this started happening after loading a 2nd model. I initially loaded Mistral 7b and it worked just fine. When I switched to openhermes, that's when it stopped sending all the layers to GPU. I think for Tegra devices, we can probably let the L4T OS handle the extra buffer and just report 100% of the free memory reported by CUDA or the system versus adding the 1GB or 10% buffer.

@remy415
Copy link
Contributor Author

remy415 commented Feb 12, 2024

@dhiltgen I think this version meets the criteria for step #1, what do you think?

@jhkuperus
Copy link

I have tested this PR on the following device:

Device used to test:
Jetson AGX Orin Developer Kit 64GB
Jetpack 6.0DP, L4T 36.2.0
CUDA 12.2.140
CUDA Capability Supported 8.7
Go version 1.21.6
Cmake 3.22.1
nvcc 12.2.140

CUDA libraries are detected and used, generation uses 100% GPU. After installation in /usr/loca/bin/ollama there were permission issues when starting it as a service under the ollama user. I don't think that has anything to do with the code on this branch though. Still looking into it in issue #1979 .

@remy415 remy415 marked this pull request as ready for review February 15, 2024 17:54
@jhkuperus
Copy link

jhkuperus commented Feb 15, 2024

I propose a change to the file scripts/install.sh to make sure the ollama user is also added to the video group. On my Jetson, the system service needed this to be able to use the CUDA cores.

On line 87, where the ollama user is added to the render group, I propose we add these lines:

    if getent group video >/dev/null 2>&1; then
        status "Adding ollama user to video group..."
        $SUDO usermod -a -G video ollama
    fi

@remy415
Copy link
Contributor Author

remy415 commented Feb 15, 2024

I propose a change to the file scripts/install.sh to make sure the ollama user is also added to the video group. On my Jetson, the system service needed this to be able to use the CUDA cores.

On line 87, where the ollama user is added to the render group, I propose we add these lines:

    if getent group video >/dev/null 2>&1; then
        status "Adding ollama user to video group..."
        $SUDO usermod -a -G video ollama
    fi

I just checked my own jetson deployment and the service for it, and I ran into the same issue with my Jetson. For some reason, it has both a render and a video group, and the service didn't work until the ollama user was added to the video group. I'll add logic for it in the script in my PR as part of the Jetson compatibility.

@jhkuperus
Copy link

I'm rewriting the NVIDIA-Jetson tutorial to match the situation after your PR is applied. I'll add it as a Gist here to see if we can also add that to the PR.

@remy415
Copy link
Contributor Author

remy415 commented Feb 15, 2024

I'm rewriting the NVIDIA-Jetson tutorial to match the situation after your PR is applied. I'll add it as a Gist here to see if we can also add that to the PR.

I've automated most of the things in the build process for getting this to work on Jetson to where if you have your lib paths properly updated you should be able to just pull it, go generate ./... && go build . and be ready to go. I'm waiting for feedback from @dhiltgen before inquiring about whether their backend build process supports the Jetson-specific changes or if there needs to be a Jetson-specific binary on top of MacOS/Windows/WSL/Linux_x64/Linux_aarch64. The main driver for this is the standard CUDA build adds AVX/AVX2 support, but AVX/AVX2 support breaks ARM compatibility. At the same time, do we really want to add the additional overhead of including a "CUDA with AVX/AVX2" and a "CUDA without AVX/AVX2" by default?

@dhiltgen
Copy link
Collaborator

@remy415 thanks! I'll try to take a look within the next few days. (I've been a bit distracted with the imminent Windows release)

@remy415
Copy link
Contributor Author

remy415 commented Feb 15, 2024

@remy415 thanks! I'll try to take a look within the next few days. (I've been a bit distracted with the imminent Windows release)

Oh I completely understand, no rush from my side. Thank you for your help and support!

@jhkuperus
Copy link

@remy415 : Here's a suggestion to replace the docs/tutorials/nvidia-jetson.md file: https://github.com/jhkuperus/ollama/blob/edefca7ef3b1b13a8a60744b4511c48dd6e1b396/docs/tutorials/nvidia-jetson.md

@remy415
Copy link
Contributor Author

remy415 commented Feb 15, 2024

@remy415 : Here's a suggestion to replace the docs/tutorials/nvidia-jetson.md file: https://github.com/jhkuperus/ollama/blob/edefca7ef3b1b13a8a60744b4511c48dd6e1b396/docs/tutorials/nvidia-jetson.md

Thank you for writing that up. I would advise on a couple things:

  1. this PR is the first of 3 steps to begin loading the prepackaged shared libraries instead of querying the host. Once that is accomplished, the tutorial will be outdated.
  2. on Jetson devices, CUDA toolkit is preinstalled. Also, the method for updating requires adding the Jetson specific nvidia repos. This will likely change again once JP6 is officially released as well.

Copy link
Collaborator

@dhiltgen dhiltgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like where this is heading. Thanks!

gpu/gpu.go Outdated
// Possible locations for the nvidia-ml library
var CudaLinuxGlobs = []string{

"/usr/local/cuda/lib*/libnvidia-ml.so*",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - I think this may yield 32/64 bit errors/warnings in the logs on more systems, so I'd move it to the bottom of the list.

gpu/gpu.go Outdated
cudaLibPaths := FindGPULibs(cudaMgmtName, cudaMgmtPatterns)
if len(cudaLibPaths) > 0 {
cuda := LoadCUDAMgmt(cudaLibPaths)
// Prefer libcudart.so if it's available
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking this might be a follow-up change to toggle the order, but maybe this is OK.

gpu/gpu.go Outdated
return
}
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want this wrapped in an else. If we did find cudart libs, but they didn't load properly, we'd still want to try to load mgmt libs next (given the current sequence in this PR). Maybe this is getting to be a bit pedantic (what are the odds cudart fails but the mgmt lib works?) but logically I think it makes sense to try each discovery model independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I did it this way is because we were trying to incorporate libnvidia-ml.so and libcudart.so under the unified "cuda" handle. If I want them to be separate, I think I would need to create a different configuration similar to the first entry I did with the gpu_info_tegra.c, etc build. I'll see what I can come up with to remove the else, but one way or another would likely involve adding another GPU handle type.

Side note: I did see your PR regarding the creation of the common LLM libraries folder. I've already successfully loaded the packaged libraries in another branch, and it wouldn't be much to change the code to support the ~/.ollama/lib filepath. I did notice that ROCM libraries aren't packaged with llama.cpp and the only libraries that get bundled are the cuda libraries.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did notice that ROCM libraries aren't packaged

Unfortunately the ROCm libraries are massive, so bundling them inside the executable like we do with cuda isn't viable. We're still trying to figure out the optimal approach for radeon cards which is part of the reason we still haven't officially launched support for them, even though they generally work on main. It's plausible we may shift to an approach of decoupled dependency downloads and lazy-load them if they weren't pre-loaded by the installer flow. That PR you mention is an incremental step that will help if we do go that route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I saw the comments you included in the gen_linux.sh file regarding how big the ROCm libraries are being the reason they aren't bundled in.

With loading the packaged drivers though, I had a hard time resolving the "chicken and egg" scenario presented: needed the library loaded to check if the device is present and ready to go, and needed to know if the device is present or not to figure out which library to load. In the end, I just loaded them all and let them fail until one was successful and then set that one as the library. I'll see if I can sneak in support for the new PR to load the pre-packaged cuda libraries.

gpu/gpu.go Outdated
@@ -248,6 +300,9 @@ func CheckVRAM() (int64, error) {
overhead = gpus * 1024 * 1024 * 1024
}
avail := int64(gpuInfo.FreeMemory - overhead)
if (gpuInfo.Library == "cuda" && CudaTegra != "") {
avail = int64(gpuInfo.FreeMemory)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me why we don't set aside the overhead on Tegra? (this probably needs a comment breadcrumb to explain the special case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tegra has a caching system it uses that eats up "available memory" but is ready to be freed on demand. When I left in the 1024MB overhead, it would frequently give "falling back to CPU -> OOM" even though JTOP was showing enough memory free. I tried to find an automated system to properly report available memory to include freeable cache and the gist is that "free -m" produces the most accurate (and accurate is almost an incorrect word) depiction of free memory in Linux and should be what is used. There isn't a library or API to query the same data other than reading it directly from /proc (and in fact, that's what the free binary does).

TL;DR I let the Tegra OS manage overhead as it seems to be better at it than me. Also the reported memory from the sysinfo library, the cuda driver free memory reported, and jtop were all inconsistent in the values reported.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest a different approach on this one. Instead of modifying this native code to be a hybrid mgmt/cudart model, instead create 2 distinct impls. You could rename this file to something like gpu_info_nvidiaml.c and leave its logic ~unchanged. Then create a new streamlined gpu_info_cudart.c that exclusively uses the cuda runtime lib for all discovery tasks. You can get rid of the lib_t toggling logic and only define and wire up the function symbols that are applicable to each library individually.

What I'm hoping is if the cudart discovery model works out, we have the option to entirely remove the nvidiaml code in the future, so I'd prefer if we can keep each approach clean and independent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that's how I had it at first I had just named it gpu_info_tegra. I think I misinterpreted your original recommendation to merge the tegra stuff under a single cuda umbrella. I'll swap it back and go with the nvml & cudart approach like you suggested above.

Comment on lines 59 to 70
"6")
echo "Jetpack 6 detected. Setting CMAKE_CUDA_ARCHITECTURES='87'"
CMAKE_CUDA_ARCHITECTURES="87"
;;
"5")
echo "Jetpack 5 detected. Setting CMAKE_CUDA_ARCHITECTURES='72;87'"
CMAKE_CUDA_ARCHITECTURES="72;87"
;;
"4")
echo "Jetpack 4 detected. Setting CMAKE_CUDA_ARCHITECTURES='53;62;72'"
CMAKE_CUDA_ARCHITECTURES="53;62;72"
;;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to build a single ARM binary that has architectures 53-87 so the same binary will work on the different jetpacks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to compile it on my Jetson leaving it the default architectures and it failed to compile. I checked into dustynv's container for llama_cpp and he explicitly sets the architectures to these based on the version of L4T running. Once I did that, it compiled without issue. If you have an ARM build system that will incorporate the other changes in this PR but build for all the architectures, I can test if it works.

I hadn't considered that this would create device-specific binaries (versus things that just load at runtime), I'll have to explore other options.

if [ -n ${TEGRA_DEVICE} ]; then
echo "CMAKE_CUDA_ARCHITECTURES unset, values are needed for Tegra devices. Using default values defined at https://github.com/dusty-nv/jetson-containers/blob/master/jetson_containers/l4t_version.py"
fi
# Tegra devices will fail generate unless architectures are set:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify this comment, when you go generate ./... on a Jetson, it breaks if you try to target the wrong architectures?

For our official builds on ARM, we leverage Docker Desktop running on ARM Mac's, so my hope was we'd be able to generate an ARM binary that was inclusive of all the architectures and run on the different JetPack systems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is what happens. It seems to be due to partial architecture support included in the Jetsons.

default_arch_build_jetson.log

Copy link
Collaborator

@dhiltgen dhiltgen Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunate they don't include enough headers/def's to be able to target other families from a given device. It sounds like there's still hope our official build could work, but users building from source on device will need to narrow scope

llm/llm.go Outdated
@@ -79,6 +79,24 @@ func New(workDir, model string, adapters, projectors []string, opts api.Options)
break
}

if info.Library == "tegra" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing where this is set. Maybe I'm missing it, but is this stale code perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot that this is still in here. Originally it just automatically set NumGPU to 999 for tegra devices if not manually set in env variable. I'll remove it

@remy415
Copy link
Contributor Author

remy415 commented Feb 20, 2024

@dhiltgen My apologies for the giant commit spams on this, I'm trying to keep my branch updated with ollama main while integrating the libcudart changes.

I think this commit may fulfill the objective of adding libcudart support. Jetson users will possibly need to include environment variables on build, but given the nature of Jetson devices as development boards, I believe they should be equipped to do so anyway. I also included logic to disable AVX extensions in the CUDA build within gen_linux.sh if the architecture is arm64 as those chips don't support it in general.

@remy415 remy415 changed the title Add Jetson support Add support for libcudart.so for CUDA devices (Adds Jetson support) Feb 20, 2024
@dhiltgen
Copy link
Collaborator

@remy415 let me know when you think this is in pretty good shape and I'll take another review pass through.

@remy415
Copy link
Contributor Author

remy415 commented Feb 27, 2024

@dhiltgen I think it's in a pretty good place for step 1 of getting the libcudart support integrated. The only thing is it needs to be tested on machines with multiple GPUs to see if the code for the meminfo lookup works correctly on those systems (and I'd need to swap the priority for nvml/cudart as I put it back to loading nvml first).

@jhkuperus
Copy link

I pulled the newest version of this PR, ran go clean, go generate ./.. and go build again, but the new binary crashes when loading any model. Here's the first bit of logging where it segfaults:

Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.254+01:00 level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.255+01:00 level=INFO source=gpu.go:200 msg="[libcudart.so] CUDART CUDA Compute Capability detected: 8.7"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.255+01:00 level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.255+01:00 level=INFO source=gpu.go:200 msg="[libcudart.so] CUDART CUDA Compute Capability detected: 8.7"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.255+01:00 level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.311+01:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama2209804735/cuda_v12/libext_server.so"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.311+01:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: SIGSEGV: segmentation violation
Mar 06 09:32:13 yoinkee-1 ollama[205621]: PC=0xfffefc285928 m=16 sigcode=1
Mar 06 09:32:13 yoinkee-1 ollama[205621]: signal arrived during cgo execution
Mar 06 09:32:13 yoinkee-1 ollama[205621]: goroutine 38 [syscall]:
Mar 06 09:32:13 yoinkee-1 ollama[205621]: runtime.cgocall(0x944740, 0x40004ce698)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /usr/local/go/src/runtime/cgocall.go:157 +0x44 fp=0x40004ce660 sp=0x40004ce620 pc=0x407e94
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/llm._Cfunc_dyn_llama_server_init({0xfffed4003f50, 0xfffefc285550, 0xfffefc277b40, 0xfffefc279100, 0xfffefc294204, 0xfffefc282ba4, 0xfffefc2>
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         _cgo_gotypes.go:286 +0x30 fp=0x40004ce690 sp=0x40004ce660 pc=0x775810
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/llm.newDynExtServer.func7(0xa77804?, 0xc?)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/llm/dyn_ext_server.go:153 +0xe0 fp=0x40004ce780 sp=0x40004ce690 pc=0x776a60
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/llm.newDynExtServer({0x40004a8db0, 0x2f}, {0x400059f3b0, 0x65}, {0x0, 0x0, _}, {_, _, _}, ...)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/llm/dyn_ext_server.go:153 +0x904 fp=0x40004cea20 sp=0x40004ce780 pc=0x776774
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/llm.newLlmServer({{0xf5796b000, 0xcdbc03000, 0x1}, {_, _}, {_, _}}, {_, _}, {_, ...}, ...)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/llm/llm.go:158 +0x308 fp=0x40004cebe0 sp=0x40004cea20 pc=0x7737e8
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/llm.New({0x4000304900, 0x15}, {0x400059f3b0, 0x65}, {0x0, 0x0, _}, {_, _, _}, ...)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/llm/llm.go:123 +0x4fc fp=0x40004cee60 sp=0x40004cebe0 pc=0x77332c
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/server.load(0x4000179680?, 0x4000179680, {{0x0, 0x800, 0x200, 0x1, 0xffffffffffffffff, 0x0, 0x0, 0x1, ...}, ...}, ...)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/server/routes.go:85 +0x308 fp=0x40004cefe0 sp=0x40004cee60 pc=0x922f28
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/server.ChatHandler(0x400058c400)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/server/routes.go:1175 +0x8fc fp=0x40004cf720 sp=0x40004cefe0 pc=0x92cbdc
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/gin-gonic/gin.(*Context).Next(...)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func1(0x400058c400)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/server/routes.go:945 +0x78 fp=0x40004cf760 sp=0x40004cf720 pc=0x92b638
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/gin-gonic/gin.(*Context).Next(...)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174

Not sure what the problem is though.

@remy415
Copy link
Contributor Author

remy415 commented Mar 6, 2024

@jhkuperus Which Jetson device do you have and which Jetpack version you are running? I haven’t tested it with JP6 yet. If you’re not running JP6, you should clean your installation as JP5 and below are not compatible with CUDA 12. Ensure you aren’t installing any new video drivers or any new CUDA toolkit software, other than the 11.8 compatibility as referenced on the Tegra support pages (it should work without this though).

In the mean time I’ll run some tests and see if I messed up this last merge.

@remy415
Copy link
Contributor Author

remy415 commented Mar 6, 2024

@jhkuperus I executed a fresh build on my Jetson Orin Nano running JP5 and didn't run into the same segfault issue you had. I would guess your issue is either something present in JP6 that I haven't tested yet, or something like the JP5 + CUDA12 issue I referenced earlier.

@jhkuperus
Copy link

Model: NVIDIA Jetson AGX Orin Developer Kit - Jetpack 6.0 DP [L4T 36.2.0]

  • Module: NVIDIA Jetson AGX Orin (64GB ram)
    Libraries:
  • CUDA: 12.2.140
  • cuDNN: 8.9.4.25
  • TensorRT: 8.6.2.3
  • VPI: 3.0.10
  • Vulkan: 1.3.204
  • OpenCV: 4.8.0 - with CUDA: NO

I tested an earlier build from this PR a week or two ago. That one works just fine. Tell me if running it with some sort of verbose-flags or debugging will help you find out what the problem is. I'll gladly help.

@remy415
Copy link
Contributor Author

remy415 commented Mar 6, 2024

@jhkuperus okay so you are running JP6, really odd that an earlier build works when this doesn’t as the only substantial changes I’ve made are to sync it with upstream. When you say you pulled the newest version, did you clone it to a fresh folder or did you git pull on top of the existing folder? Try deleting the entire ollama folder and clone it again.

Additionally, if you’re running it manually with ./ollama serve, all you need to do is export OLLAMA_DEBUG=1 to turn on verbose debugging.

Alternatively, JP6 is supposed to include libnvidia-ml.so support according to dustynv. I haven’t looked through it myself but if that’s true then the binary straight from Ollama may work.

@jhkuperus
Copy link

I removed the entire repo, checked it out anew and ran the build again. It's working and I also know where my snag was. I ran go generate ./.. instead of go generate ./.... I now have a version that is also capable of running Gemma. Forget my segfault-notice earlier, the branch is still working fine under JP6 with Cuda12.

@remy415
Copy link
Contributor Author

remy415 commented Mar 6, 2024

@dhiltgen I think I may have figured out the issue with having to set specific architectures on Jetson devices.

It may be related to an upstream llama.cpp issue fixed here, and somewhat explained here:

There are a couple patches applied to the legacy GGML fork:
fixed __fp16 typedef in llama.h on ARM64 (use half with NVCC)
parsing of BOS/EOS tokens (see ggerganov/llama.cpp#1931)

Seems like the issue is related to fp16 typedef in llama.h for ARM64 platforms. Confirmed (somewhat) when I unset CMAKE_CUDA_ARCHITECTURES (thus compiling the default), and set "-DLLAMA_CUDA_F16=off". It compiled (though took quite a bit longer than it did previously); I think either turning off the f16 cuda option in the gen_linux.sh file or including the referenced patch from dustynv should shore this issue up and reduce overall changes. I'll defer to your judgement on this one.

@remy415
Copy link
Contributor Author

remy415 commented Mar 6, 2024

Update:

I've done more digging into the F16 issue. I'm not sure why my particular compiler is having this issue, but it would seem the crux of the problem is somewhat alluded to in a NVidia CUDA-8 Mixed Precision Guide:

CUFFT: FP16 computation requires a GPU with Compute Capability 5.3 or later (Maxwell architecture).

And on another github issues thread , they also reference that that's because when graphic card's sm version less than 6.0, they don't support fp16. I'm not sure that's 100% accurate, but it would seem that sm < 6.0 has shaky fp16 support at best.

Also NVidia CUDA Programming Guide has some references:
The 32-bit __half2 floating-point version of atomicAdd() is only supported by devices of compute capability 6.x and higher.
The 16-bit __half floating-point version of atomicAdd() is only supported by devices of compute capability 7.x and higher.

Additionally, when I tried to compile with LLAMA_CUDA_F16=on and CMAKE_CUDA_ARCHITECTURES="50;52;61;70;75;80", I receive this error:

/home/tegra/ok3d/ollama-container/dev/ollama/llm/llama.cpp/ggml-cuda.cu(6324): error: more than one conversion function from "__half" to a built-in type applies:
            function "__half::operator float() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(204): here
            function "__half::operator short() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(222): here
            function "__half::operator unsigned short() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(225): here
            function "__half::operator int() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(228): here
            function "__half::operator unsigned int() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(231): here
            function "__half::operator long long() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(234): here
            function "__half::operator unsigned long long() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(237): here
            function "__half::operator __nv_bool() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(241): here

This error goes away if I change CMAKE_CUDA_ARCHITECTURES="61;70;75;80" and everything works swimmingly.

@dhiltgen do you know why -DLLAMA_CUDA_FORCE_MMQ=on was enabled? I was under the impression that it was preferred to not force MMQ which will enable Tensor cores to be used; compile & GPU execute worked fine either way for me, I haven't tested performance though.

@remy415
Copy link
Contributor Author

remy415 commented Mar 6, 2024

It also seems like llama.cpp upstream changed they way they included __half support for ARM devices several times in the last few months:
Aug 2023 PR
Sep 2023 Commit
Jan 2024 Issue
Jan 2024 PR with Quote:

Jan 20, 2024: Thanks for the discussion - IMO the fundamental issue is that ggml_fp16_t is exposed through the public ggml API in the first place. It's something to fix in the future, but for now will merge this workaround

Feb 2024 Commit

@remy415
Copy link
Contributor Author

remy415 commented Mar 6, 2024

This error goes away if I change CMAKE_CUDA_ARCHITECTURES="61;70;75;80" and everything works swimmingly.

Just to clarify: If CUDA Architecture is "50;51", setting LLAMA_CUDA_F16=off allows it to compile. CMAKE_CUDA_ARCHITECTURES="61;70;75;80" properly supports CUDA_F16.

@dhiltgen
Copy link
Collaborator

dhiltgen commented Mar 6, 2024

do you know why -DLLAMA_CUDA_FORCE_MMQ=on was enabled?

This was needed to add support for older GPUs and based on the testing we did at the time, didn't seem to have a major performance impact for newer GPUs.

For Jetson support Compute Capability 5.0 support isn't relevant as far as I know, so this flag can be omitted.

@remy415
Copy link
Contributor Author

remy415 commented Mar 6, 2024

Okay so I guess the takeaway is for ARM-based CUDA builds, leave architectures at default and disable f16 support, and it should be golden.

@remy415
Copy link
Contributor Author

remy415 commented Mar 8, 2024

@dhiltgen I merged the PR with the latest Ollama release removing most of the AMD code from 'gpu.go'. I tested the build on my ARM and WSL+CUDA setups, and it looks like it's still good to go.

I also adjusted the memory overhead section to better align with the original code by setting overhead to 0 if the L4T env variable is detected. I know it would be preferred to leave in an overhead buffer, but the L4T OS automatic caching messes with the reported free memory (it is reported lower than it actually is) so sometimes when loading 7B models, the free memory reported is less than is needed by the model loader and it falls back to CPU. Maybe we can add a flag somewhere to just have the user manually disable overhead buffer assignment and leave it on by default? Not sure how to handle this.

Anyway, other than that the PR is ready for review with the latest release merged.

@davidtheITguy
Copy link

davidtheITguy commented Mar 8, 2024 via email

Copy link
Collaborator

@dhiltgen dhiltgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

I made a few minor changes to get windows working, and favor the cudart lib we cary as payload. remy415#1

Also needs a rebase.

Comment on lines 127 to 159
if (h.verbose) {
cudartBrandType_t brand = 0;
// When in verbose mode, report more information about
// the card we discover, but don't fail on error
// Need to map out alternatives of these for CUDART libraries
// For now just returning a generic "unsupported" error.
ret = CUDART_UNSUPPORTED;
if (ret != CUDART_SUCCESS) {
LOG(h.verbose, "nvmlDeviceGetName unsupported with CUDART libraries: %d\n", ret);
} else {
LOG(h.verbose, "[%d] CUDA device name: %s\n", i, buf);
}
if (ret != CUDART_SUCCESS) {
LOG(h.verbose, "nvmlDeviceGetBoardPartNumber unsupported with CUDART libraries: %d\n", ret);
} else {
LOG(h.verbose, "[%d] CUDA part number: %s\n", i, buf);
}
if (ret != CUDART_SUCCESS) {
LOG(h.verbose, "nvmlDeviceGetSerial unsupported with CUDART libraries: %d\n", ret);
} else {
LOG(h.verbose, "[%d] CUDA S/N: %s\n", i, buf);
}
if (ret != CUDART_SUCCESS) {
LOG(h.verbose, "nvmlDeviceGetVbiosVersion unsupported with CUDART libraries: %d\n", ret);
} else {
LOG(h.verbose, "[%d] CUDA vbios version: %s\n", i, buf);
}
if (ret != CUDART_SUCCESS) {
LOG(h.verbose, "nvmlDeviceGetBrand unsupported with CUDART libraries: %d\n", ret);
} else {
LOG(h.verbose, "[%d] CUDA brand: %d\n", i, brand);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go ahead and remove this.

resp->err = strdup(buf);
return;
}
ret = (*h.cudaMemGetInfo)(&resp->free, &resp->total);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should pass memInfo.free and total here so we can aggregate multi-GPU setups on line 164,165.

Comment on lines 30 to 33
typedef enum cudartBrandType_enum
{
CUDART_BRAND_UNKNOWN = 0,
} cudartBrandType_t;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is unused and can be removed.

@remy415
Copy link
Contributor Author

remy415 commented Mar 22, 2024

@dhiltgen all set, I think. Sorry, this is my first time working with a PR.
See below

@remy415
Copy link
Contributor Author

remy415 commented Mar 22, 2024

@dhiltgen I see AssetsDir() referenced in gpu.go, but in the current ollama build there is no AssetsDir() defined in assets.go. Can you help me reconcile the differences between the current assets.go and the one referenced in your branch here?

@dhiltgen
Copy link
Collaborator

Sorry about that. I should have left some comments to aid in the rebase.

Use PayloadsDir https://github.com/ollama/ollama/blob/main/gpu/assets.go#L21

We're reshuffling things more with #3218 so depending on when we get this vs. that PR merged there may be some more rebase adjustments necessary.

@remy415
Copy link
Contributor Author

remy415 commented Mar 25, 2024

@dhiltgen I tested this on my Jetson and it compiled. Running a WSL test now. Should be good to go from here.

@remy415
Copy link
Contributor Author

remy415 commented Mar 25, 2024

@dhiltgen Thank you for your support!

@buroa
Copy link

buroa commented Mar 28, 2024

This is forcing everyone else to use cudart now starting with 1.30.0. This shouldn't of been merged as-is.

@remy415
Copy link
Contributor Author

remy415 commented Mar 28, 2024

@buroa The intent is to shift NVidia users to using embedded cudart driver and phase-out libnvidia-ml.so completely once the transition is stable.

Unless you mean non-CUDA users are forced to use cudart?

@dhiltgen
Copy link
Collaborator

@buroa cudart is bundled into the released binaries, and we should gracefully handle failures to dynamically load this library. Systems without nvidia GPUs shouldn't see a hard dependency on cudart. If you think we missed some corner case, please let us know more details.

@Czhazha
Copy link

Czhazha commented Mar 30, 2024

Can ollama support the GPU operation on Jetson now? I followed the tutorial to run it on the Jetson Orin Nano, but it still shows 0% GPU utilization. I’m using JetPack 5.1.1.

@remy415
Copy link
Contributor Author

remy415 commented Mar 30, 2024

@Czhazha yes but you need to compile it yourself for now, or after running ollama serve you need to manually delete the embedded CUDA libraries that were extracted in the /tmp folder /tmp/ollama####/runners, then execute ollama run <model>

Working on fixing this

@Czhazha
Copy link

Czhazha commented Mar 30, 2024

@Czhazha yes but you need to compile it yourself for now, or after running ollama serve you need to manually delete the embedded CUDA libraries that were extracted in the /tmp folder /tmp/ollama####/runners, then execute ollama run <model>

Working on fixing this

Thank you. After following your process, I encountered a new error message when running it again. CUDA error: CUBLAS_STATUS_ALLOC_FAILED.

...
llama_kv_cache_init: CUDA0 KV buffer size = 384.00 MiB
3月 30 23:22:57 nvidia ollama[7927]: llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 >
3月 30 23:22:57 nvidia ollama[7927]: llama_new_context_with_model: CUDA_Host output buffer size = 300.75 MiB
3月 30 23:22:58 nvidia ollama[7927]: llama_new_context_with_model: CUDA0 compute buffer size = 300.75 MiB
3月 30 23:22:58 nvidia ollama[7927]: llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB
3月 30 23:22:58 nvidia ollama[7927]: llama_new_context_with_model: graph nodes = 868
3月 30 23:22:58 nvidia ollama[7927]: llama_new_context_with_model: graph splits = 2
3月 30 23:22:58 nvidia ollama[7927]: CUDA error: CUBLAS_STATUS_ALLOC_FAILED
3月 30 23:22:58 nvidia ollama[7927]: current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/g>
3月 30 23:22:58 nvidia ollama[7927]: cublasCreate_v2(&cublas_handles[device])
3月 30 23:22:58 nvidia ollama[7927]: loading library /tmp/ollama3449089709/runners/cuda_v11/libext_server.so
3月 30 23:22:58 nvidia ollama[7927]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:193: !"CUDA error"
3月 30 23:22:58 nvidia ollama[8189]: Could not attach to process. If your uid matches the uid of the target
3月 30 23:22:58 nvidia ollama[8189]: process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
3月 30 23:22:58 nvidia ollama[8189]: again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
3月 30 23:22:58 nvidia ollama[8189]: ptrace: Operation not permitted.
3月 30 23:22:58 nvidia ollama[8189]: No stack.
3月 30 23:22:58 nvidia ollama[8189]: The program is not being run.
3月 30 23:22:58 nvidia ollama[7927]: SIGABRT: abort
...

@remy415
Copy link
Contributor Author

remy415 commented Mar 30, 2024

@Czhazha okay you’re getting the same error we are. You may need to build it yourself until we figure out the CUDA build issue, please reference the build instructions in #3406

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants