-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for running llama.cpp with SYCL for Intel GPUs #2458
base: main
Are you sure you want to change the base?
Conversation
It works now! I just forgot to add the |
Is it possible to run ollama on Windows yet? I only tested this on Linux, but if it's possible to run on Windows I could make sure it works there as well. |
I saw #403 (comment) but I haven't tried it yet. |
@felipeagc do you have a build I can give a try? I tried building it, but openapi-basekit is 12 GB large and I don't have that much space on my laptop. |
A related question is, do you know how the performance compares to Vulkan? Maybe you can also take a look here: #2396 |
@ddpasa Since I'm not embedding the oneAPI runtime libraries into ollama, you're going to need to install the basekit unfortunately. I see that in the I have not tested Vulkan yet, but I suspect it's going to be slower. Will report back on this later after testing though. |
@Leo512bit great, I'll give it a try. |
These are the oneAPI libraries we would need to bundle with ollama:
Would this be considered too big? I also saw this comment in
|
A few updates: I tried getting this to work on Windows, but no success yet. I got it to build ollama and link to the oneAPI libraries, but I'm still having problems with llama.cpp not seeing the GPU. Running the I also tried to run it on WSL2, but I'm getting a segfault in Intel's Level Zero, which is the library I used to query information about the GPU. Intel says WSL2 is supported, so I'll have to look into this a bit more. |
Can you please write down build instructions on Ubuntu? I'll help you with some feedback and benchmarks. |
@chsasank Sure:
I'm not even sure if it's going to work on ubuntu yet, I only tried on Arch Linux. I tried running on ubuntu on WSL2, but sadly I found out that my A750 does not support virtualization. Anyway, please tell me if there is any problem :) As for benchmarks, this is my first time running LLMs locally so I have no point of reference. I'm getting about 6 tokens/sec on my CPU (Ryzen 5 5600G) and about 20 tokens/sec on my GPU (Intel ARC A750 8GB) running llama2 7b. I haven't measured exact numbers, but interestingly my Macbook Air M1 16GB has very similar speed to the A750, I'm not sure that should be the case, I'd expect the dedicated GPU to be faster than a laptop. EDIT: measured the speed on the Macbook Air M1 and it's doing around 13 tokens/sec on the same models. |
I have Arc 770 card and I use OneAPI samples to do some benchmarks. Follow last steps of this tutorial (https://chsasank.com/intel-arc-gpu-driver-oneapi-installation.html) to do benchmarks of fp16 matrix multiplication. I'll meanwhile build for Arc 770 and get with some results. |
Here are benchmarks on my Arc 770 16 GB for reference:
On M2, matmul tflops is around 1 or 2. Check this: https://gist.github.com/chsasank/407df67ac0c848d6259f0340887648a9 I will also replicate above using Intel Pytorch Extensions. |
@chsasank It would be cool if you could benchmark llama.cpp against https://github.com/intel-analytics/BigDL from Intel to see if there's an advantage to using their first party solution. |
Making a list of benchmark comparisons:
Lemme know if I should add anything else. Meanwhile, can you also reproduce matrix_mul_mkl on your arc 750 dev env? |
I have done benchmarks of mistral 7b int4 for M2 Air, Intel 12400 and Arc 770 16GB. I used llama-bench and mistral 7b model from here to find tok/s for prompt and text generation tok/s. On M2 Air
On Intel 12400 (compiled with sycl but made num-gpu-layers (ngl) = 0)
On Arc 770
I compiled llama.cpp with commit in the PR. Good news is prompt processing time is somewhat high. Bade news is text generation on Arc GPUs is very low. I will do further analysis and create a issue on llama.cpp repo. |
Would this bundle something that would work on my laptop without needing to install oneapi? If so, I'm eager to try this out |
@chsasank Here are the results from my A750 on the same model you tested:
(this is with F16 turned on) |
@ddpasa Yes, but I haven't configured bundling of the libraries yet. I'll try doing this today. Out of curiosity, which GPU do you have on your laptop? |
it's an Iris Plus G7, works really well with ncnn, I'm hoping for a similar experience. |
@ddpasa I couldn't get the oneAPI libraries to work when bundled with ollama, I think your best bet is just to install the base toolkit unfortunately.
|
Update: added support for building oneAPI-enabled docker images. @chsasank @ddpasa I also tested my A750 with llama.cpp's Vulkan backend and the results are interesting:
Both are faster than the SYCL version, and Windows is slightly faster. |
Vulkan results are interesting! Did you follow the instructions from here? https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#vulkan I will reproduce the results with llama-bench. By the way, I created an issue about performance at ggerganov/llama.cpp#5480. I think we need a performant baseline that utilizes GPU well. |
@chsasank Yes, and I tried running llama-bench with Vulkan but got really bad results (around 3 tok/s), with the last run not even finishing, which is strange. But running the
Indeed, my initial guess was that the current best performing solution was BigDL-LLM, simply because it's made by Intel. It's a pain to install, but I got it working a couple of days ago and the performance is not all that different from llama.cpp. I did not make any precise measurements though (and I'm too lazy to go through their setup again haha). If you want to give it might give us more insight into this. |
I observed last run not even finishing for other tests as well. But you're right getting very slow tok/s in llama-bench. Makes you wonder if llama-bench is accurate! Vulkan0: Intel(R) Arc(tm) A770 Graphics (DG2) | uma: 0 | fp16: 1 | warp size: 32
I too tried installing bigdl and indeed it's a bit of pain. Besides examples in the repo are neither straightforward nor self contained. I don't think the assumption that the first party repos having good performance is really accurate right now. So far I have seen that Intel Pytorch extensions (IPEX) are pretty performant. I have done some benchmarks and found that matmul flops match OneMKL because pytorch essentially is a wrapper over them: OneMKL:
PyTorch
LLM inference is actually pretty straight forward - see llama2.c and vanilla-llama. May be it's worth it to hack vanilla-llama from the above to work with Intel GPUs and that can be our baseline. I am also working on pure OneAPI based backend for LLM inference but paused a bit on it because llama.cpp got sycl support. I guess I have to get back to it again may be. |
@chsasank Very interesting, I'm actually pretty new to this so I'll look at llama2.c for sure. You should definitely work on the pure oneAPI version, that would be a great project! |
I followed the instructions and it's not working for me |
I followed the instructions
<#2458 (comment)> and
it's not working for me
image.png (view on web)
<https://github.com/ollama/ollama/assets/64481039/df9fd925-bcdc-443c-884e-a0690af7c69e>
I think you need to instal the oneAPI base toolkit (or whatever it's called)
…On Wed, Feb 14, 2024, 6:52 AM taep96 ***@***.***> wrote:
I followed the instructions
<#2458 (comment)> and
it's not working for me
image.png (view on web)
<https://github.com/ollama/ollama/assets/64481039/df9fd925-bcdc-443c-884e-a0690af7c69e>
—
Reply to this email directly, view it on GitHub
<#2458 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/APLTL6ADCDMHUUBP5V3DVH3YTTFSZAVCNFSM6AAAAABDD6CCYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBTHE3DKNRYG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I do have it installed
Sorry then, I haven't tried compiling this stuff yet so I don't know what
it might be.
…On Wed, Feb 14, 2024, 8:36 AM taep96 ***@***.***> wrote:
I do have it installed
image.png (view on web)
<https://github.com/ollama/ollama/assets/64481039/2ef0f24f-3220-40cb-a554-d162f66f3b7b>
—
Reply to this email directly, view it on GitHub
<#2458 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/APLTL6FKVULBYF6UI3SDXNDYTTRZHAVCNFSM6AAAAABDD6CCYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBUGE4TIOBWHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It's not finding the level zero library, which is part of Intel's driver. It should have already been installed, so maybe your linux distro installs it somewhere else. Can you locate where libze_intel_gpu.so is on your machine? |
Turns out it's provided by |
Really? I thought Intel Arc supported SR-IOV, did you enable it in UEFU? I do have in A770 16GB so maybe only the fat one supports it? (I don't know haven't tried passthrough on Arc yet.) Anyways I tried compiling on WSL2 but I got this mess. Like Why was it in my VMware in install? |
EDIT: Please see my comment below, I was able to get past this on 22.04. I'm testing this on a fresh install of Kubuntu 23.10. My GPU is an Arc A770 16 GB. I installed the Intel oneAPI base toolkit, and have the following go, cmake, and gcc versions.
I was able to build with This is now the following output when trying to serve, it segfaults:
I confirmed that the oneAPI environment can be loaded manually:
I also tested this under Ubuntu 22.04 in WSL earlier on which surprisingly enough had the same result -- a segfault after trying to serve it. |
EDIT: I missed something silly -- I didn't run I came to the conclusion the segfault was related to drivers, and I've since installed Kubuntu 22.04 since 22.04 is what Intel seems to have validated everything on. Doing so helped me get farther along, but I'm still running into issues. To detail my setup process:
EDIT: Be sure to initialize your oneAPI environment with Finally
Just to sanity check, I tested PyTorch per https://intel.github.io/intel-extension-for-pytorch/index.html#installation and the GPU is detected:
I also added my user to the |
this pull already conflicts :/ what hold things back to get it merged? |
Hey everyone, I'm currently in the process of moving, so I don't have access to my PC with an Intel Arc and won't have for a little while. If anyone wants to take over this PR, please feel free. |
Can this PR generate output correctly on Intel Arc (ubuntu)? I got some error outputs like: ollama run example "What is your favourite condiment?"
!##"##! "!▅
▅
"! $ #"# ## ▅"#! |
Hi @chsasank , can you run ollama normally on Ubuntu with an Intel ARC graphics card? |
works just fine for me ... |
I see... you built a docker image? By the way, did you run it on ubuntu? |
no I build it on archlinux as normal binnary |
Just one additional thought, does SYCL only work on Intel GPUs or does it work also on AMD GPUs? ROCm sometimes have some odd quirks on Windows (such as GFX version mismatch) that prevent it from working properly, and it will be good to see if OpenCL can be used with AMD GPUs which has much better support. Though this require SYCL support to run on native Windows, since either ROCm or mesa-opencl stuff does not seem to support calling AMD GPU inside WSL2, no matter virtualization is enabled or not. |
Hi @chsasank and @felipeagc, Thank you for sharing your concerns about installing BigDL-LLM :) I'm on the development team and we’d love to help out. Could you share more about the installation problems you're facing? We have also updated our installation guide on Intel GPUs recently, and added detailed Quickstart guide (regarding installation, benchmarking, etc) that might help. Please feel free to review them at your convenience and share any thoughts or feedback you might have. Thank you! |
Hi @felipeagc, thank you for making it possible for this outstanding ollama project to run on Intel GPUs. Let’s work together to push the progress of this PR. I have rebased the latest ollama main branch and verified that it works well on Ubuntu 22.04 + Arc 770 GPU. I also create a pr2, once it gets merged (or create a new pr based on pr2 to ollama), we can further discuss with ollama's maintainers how to proceed with merging pr2458 into the ollama main branch. I’m new to this project and unfamiliar with many aspects, so I appreciate any guidance from the community. Thank you! |
Hello, Logs
AFAIK this dGPU supports OneAPI/SYCL and should work. I'm happy to test this project for you on my dGPU but I am not familiar enough with this project, GPU programming, etc., to take this over. EDIT: The dGPUs are in the container: EDIT1: From using |
This is my attempt at adding SYCL support to ollama.
It's not working yet, and there are still some parts marked as TODO.If anyone wants to take a crack at finishing this PR, I'm currently stuck on this error:It's probably due to the way ollama builds the C++ parts and Intel's compiler not expecting it to be done in this way. The kernels are probably getting eliminated from the binary in some build step.I'm not sure when I'm going to have more time to work on this PR, so I'll just leave it here as a draft for now.EDIT: it works now :)