-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Vulkan support to ollama #5059
base: main
Are you sure you want to change the base?
Conversation
Are there any available instructions or guides that outline the steps to install Ollama from its source code on a Windows operating system? I have a Windows 10 machine equipped with an Arc A770 GPU with 8GB of memory |
https://github.com/ollama/ollama/blob/main/docs/development.md |
I compiled and ran this on Linux (arch, with Intel iGPU). It seems to work as correctly, with the performance and output similar to my hacky version on #2578 . I think we can abandon my version in favour of this (it was never meant to be merged anyway). |
gpu/gpu.go
Outdated
index: i, | ||
} | ||
|
||
C.vk_check_vram(*vHandles.vulkan, C.int(i), &memInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be nice to have a debugging log here printing the amount of memory detected. (especially with iGPUs this number can be useful)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't ollama do it already? When I was debugging I saw something like Jun 15 20:25:32 rofl strace[403896]: time=2024-06-15T20:25:32.702+08:00 level=INFO source=types.go:102 msg="inference compute" id=0 library=vulkan compute=1.3 driver=1.3 name="Intel(R) Arc(tm) A770 Graphics (DG2)" total="15.9 GiB" available="14.3 GiB"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, but I don't that line exactly. Looks like a CAP_PERFMON thing or I messed up compilation:
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) Iris(R) Plus Graphics (ICL GT2) | uma: 1 | fp16: 1 | warp size: 32
llama_new_context_with_model: Vulkan_Host output buffer size = 0.14 MiB
llama_new_context_with_model: Vulkan0 compute buffer size = 234.06 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 24.01 MiB
Vulkan.time=2024-06-16T13:20:27.582+02:00 level=DEBUG source=gpu.go:649 msg="Unable to load vulkan" library=/usr/lib64/libvulkan.so.1.3.279 /usr/lib64/libcap.so.2.69=error !BADKEY="performance monitoring is not allowed. Please enable CAP_PERFMON or run as root to use Vulkan."
nvtop reveals iGPU being used as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe run ollama as root? Or do setcap cap_perfmon=+ep /path/to/ollama
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
setcap didn't work for some reason, I still get CAP_PERFMON errors. But running with sudo gives:
time=2024-06-16T13:52:35.115+02:00 level=INFO source=gpu.go:355 msg="error looking up vulkan GPU memory" error="device is a CPU"
time=2024-06-16T13:52:35.130+02:00 level=INFO source=types.go:102 msg="inference compute" id=0 library=oneapi compute="" driver=0.0 name="Intel(R) Iris(R) Plus Graphics" total="0 B" available="0 B"
time=2024-06-16T13:52:35.130+02:00 level=INFO source=types.go:102 msg="inference compute" id=0 library=vulkan compute=1.3 driver=1.3 name="Intel(R) Iris(R) Plus Graphics (ICL GT2)" total="11.4 GiB" available="8.4 GiB"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vulkan is reporting that the device is a CPU. If it's an iGPU it should've been detected.
You mentioned the performance was similar to when you were testing your branch. Are you sure you are not using CPU inference the entire time? Can you compare the performance against a CPU runner like cpu_avx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second read nevermind. It seems like everything is working as expected. Ollama detected two Vulkan devices, one is a CPU software implementation, which is skipped according to the error message, and the last line reports a Vulkan device that is recognized by ollama, which is the actual iGPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that looks right. There is also a lot of oneAPI junk in the logs that confuses me. But it looks like Vulkan works as intended, but I have a CAP_PERFMON problem.
nvtop screenshot below:
I wonder why setcap does not work... Could it be that one of the shared libraries (like libcap or libvulkan) needs a setcap instead of ollama binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, the CAP_PERFMON is likely due to something off in my system. It's trying to load the 32bit library for some reason:
time=2024-06-16T14:58:06.326+02:00 level=DEBUG source=gpu.go:649 msg="Unable to load vulkan" library=/usr/lib/libvulkan.so.1.3.279 /usr/lib32/libcap.so.2.69=error !BADKEY="Unable to load /usr/lib32/libcap.so.2.69 library to query for Vulkan GPUs: /usr/lib32/libcap.so.2.69: wrong ELF class: ELFCLASS32"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading 32bit is expected, it's not related because it'll just skip it when it realizes it can't load it.
gpu/gpu_linux.go
Outdated
} | ||
|
||
var capLinuxGlobs = []string{ | ||
"/usr/lib/x86_64-linux-gnu/libcap.so*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding * after /user/lib also detects 32bit libraries in the system. Not sure if you want this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this depends on the OS? I need to specify only lib64 on fedora for this to work as lib is 32bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, regarding x86_64
specific usage, this doesn't work on aarch64
like my Raspberry Pi :)
@dhiltgen mind reviewing this? I'd imagine this would be pretty useful for plenty of people. Intel ARC GPUs perform faster on Vulkan than with oneapi and oneapi is still not packaged yet on NixOS. Someone has also mailed me noting how Vulkan support has let them run ollama on Polaris GPUs much faster. |
Working well here on an RX 5700 (the notorious gfx1010). Hoping I can use rocm again with 6.2, but this is a great alternative. |
I managed to run this on Windows and an AMD GPU, and if I'm successful I'll reciprocate the way I tried. |
Interesting, I had expected that because I hadn't implemented Vulkan library loading in Windows it wouldn't have detected any Vulkan devices. Please do share how you did it. |
I'll add the corresponding code, but I'm not that familiar with vulkan and this may take time. |
Not just Arc, but it also gives nice speedups in Intel iGPUs too (Iris series). |
It works perfectly on Arch Linux with my RX 6700XT as well which doesn't have official ROCm support. I did encounter a couple of hiccups while setting it up, though they're probably distro specific issues with my Arch Linux installation. I'll post the changes I made just for the record.
otherwise Ollama wouldn't compile with Vulkan support.
|
@utherbone, try running
|
Hmm, for some reason, it doesn’t work for me (Manjaro Linux, Kernel 6.12, 5700 XT, drivers 24.20, qwen2.5-coder:7b). Ollama detects my GPU, but during model execution, it’s not being used at all. The --n-gpu-layers (num_gpu) parameter is passed to the runner, but it seems to be ignored for some reason... At the same time, LM Studio works perfectly and fully utilizes the GPU, achieving 35 tokens/sec |
Does anyone have a hint on how to build Ollama (based on v0.5.7) with this PR or more specifically with pufferffish:vulkan? I don't see any Makefile rules for vulkan. So only thing I was able to try so far, was building with
And the result. Don't see any vulkan specific runners:
|
Probably a dumb question, but how do you build this on Windows? I keep getting |
FYI, issues are disabled in the fork, so where is it best to raise an issue? |
My bad I've enabled it |
Sync vendored ggml to add Vulkan support
Vulkan support is important! I was able to run ollama + vulkan locally with ~21% improvement over CPU inference with an AMD iGPU (8700G vs 780M). |
@jmorganca Ollama is plenty fast on Nvidia GPUs and optimising the speed can wait. Adding support to platforms that enable a massive segment of the market (iGPU users) should take priority over gaining higher speeds on platforms that are already working well enough. Regarding testing, if the main has support for iGPUs (890M for example) I'm sure there are enough members in the community who are more than willing to help with the testing (I know I am if it helps me to get Ollama using my GPU inference). |
Hello Everyone! It seems like there isn’t an easily accessible Docker image for Ollama with Vulkan support, or at least, it’s hard to find one. So, I decided to create one using @whyvl's fork along with some patches shared in his fork’s discussions. If you’re looking for a straightforward way to run Ollama with Vulkan support, you can use the following Docker command: docker run -v ~/.ollama:/root/.ollama --name ollama --device /dev/dri:/dev/dri --cap-add PERFMON -p 11434:11434 ahmedsaed26/ollama-vulkan Or, if you prefer Docker Compose, use this configuration: services:
ollama:
image: ahmedsaed26/ollama-vulkan
container_name: ollama
ports:
- "11434:11434"
volumes:
- ~/.ollama:/root/.ollama
devices:
- /dev/dri:/dev/dri
cap_add:
- PERFMON Then, start the container with: docker compose up -d Currently, this image includes Ollama v0.5.11, and I have only tested it on an AMD Radeon RX 470 GPU on linux. Hope this helps! Let me know if you run into any issues. 🚀 |
Hi, I am new to this issue. Is there a quick start guide that I can try Ollama with Vulkan backend? I am running Ubuntu, I did some speed tests ( https://github.com/eliranwong/AMD_iGPU_AI_Setup/tree/main#speed-tests ) to comparing the performance llama.cpp with Vulkan and ROCm backends, and that of Ollama. In view of the test results, Ollama is not working very well. I would like to try Ollama with Vulkan. seeing long post here, wondering if someone may give me some hints how to compile with Vulkan support ... |
The author of this PR has been weirdly quiet about providing build instructions. Not here, not in an GH issue in the forked repository (whyvl#7), while actively pushing code updates and merging PRs. I honestly have no idea what's going on. But I would recommend you going through the steps other people provided in that issue instead of waiting for an "official" response. |
For those who want to test Ollama Vulkan without building it and have build related issues, you can find ready to use binary in these (@eliranwong @rwalle etc.) I personally tested both and they work well for me. |
May I prefer build instruction instead? I encountered errors when running your binary:
I would like to compile on my side, thanks. |
Sorry to hear that. I have not tried to build it, the best thing I can suggest is to read that issue (whyvl#7) and extrapolate the correct build instructions. I wanted to work on a "clean" fork with clean build instructions, but I did not have the time. |
Please have a look into the Dockerfile contents, I have posted here. If you follow most of the Additionally, the patch I'm using there is this one: whyvl#7 (comment) |
Hello, i know i am a bit late to the party but i would just like to warn that those binaries were made by third party,not the project maintainer. I did some scans with virus total and locally with clam and maldet and seem okay. But just wanted to remind people that installing/running random binaries is not ideal. I would advice a CI/CD Many thanks, |
Absolutely agree. Always feel free to build yourself. |
Edit: (2025/01/19)
It's been around 7 months and ollama devs don't seem to be interested in merging this PR. I'll maintain this fork as a separate project from now on. If you have any issues please raise them in the fork's repo so I can keep track of them.
This PR adds vulkan support to ollama with a proper memory monitoring implementation. This closes #2033 and replaces #2578 which does not implement proper memory monitoring.
Note that this implementation does not support GPU without
VkPhysicalDeviceMemoryBudgetPropertiesEXT
support. This shouldn't be a problem since on Linux the mesa driver supports it for all Intel devices afaik.CAP_PERFMON
capability is also needed for memory monitoring. This can be done by specifically enablingCAP_PERFMON
when running ollama as a systemd service by addingAmbientCapabilities=CAP_PERFMON
to the service or just run ollama as root.Vulkan devices that are CPUs under the hood (e.g. llvmpipe) are also not supported. This is purposely done so to avoid accidentally using CPUs for accelerated inference. Let me know if you think this behavior should be changed.
I've not tested this on Windows nor have I implemented the logic for building ollama with Vulkan support yet because I don't use Windows. If someone can help me with this that would be great.
I've tested this on my machine with an Intel Arc A770: