Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for running llama.cpp with SYCL for Intel GPUs #2458

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

felipeagc
Copy link

@felipeagc felipeagc commented Feb 12, 2024

This is my attempt at adding SYCL support to ollama. It's not working yet, and there are still some parts marked as TODO.

If anyone wants to take a crack at finishing this PR, I'm currently stuck on this error:

No kernel named _ZTSZZL17rms_norm_f32_syclPKfPfiifPN4sycl3_V15queueEENKUlRNS3_7handlerEE0_clES7_EUlNS3_7nd_itemILi3EEEE_ was found -46 (PI_ERROR_INVALID_KERNEL_NAME)Exception caught at file:/home/felipe/Code/go/ollama/llm/llama.cpp/ggml-sycl.cpp, line:12708

It's probably due to the way ollama builds the C++ parts and Intel's compiler not expecting it to be done in this way. The kernels are probably getting eliminated from the binary in some build step.

I'm not sure when I'm going to have more time to work on this PR, so I'll just leave it here as a draft for now.

EDIT: it works now :)

@felipeagc
Copy link
Author

It works now! I just forgot to add the -fsycl compiler flag. I also made it so you don't need to setup the oneAPI environment variables yourself, at build-time the gen_linux.sh script does it for you, and at runtime it uses rpath to find the libraries.

@felipeagc felipeagc marked this pull request as ready for review February 12, 2024 05:49
@felipeagc
Copy link
Author

Is it possible to run ollama on Windows yet? I only tested this on Linux, but if it's possible to run on Windows I could make sure it works there as well.

@Leo512bit
Copy link

Leo512bit commented Feb 12, 2024

I saw #403 (comment) but I haven't tried it yet.

@ddpasa
Copy link

ddpasa commented Feb 12, 2024

It works now! I just forgot to add the -fsycl compiler flag. I also made it so you don't need to setup the oneAPI environment variables yourself, at build-time the gen_linux.sh script does it for you, and at runtime it uses rpath to find the libraries.

@felipeagc do you have a build I can give a try? I tried building it, but openapi-basekit is 12 GB large and I don't have that much space on my laptop.

@ddpasa
Copy link

ddpasa commented Feb 12, 2024

A related question is, do you know how the performance compares to Vulkan? Maybe you can also take a look here: #2396

@felipeagc
Copy link
Author

@ddpasa Since I'm not embedding the oneAPI runtime libraries into ollama, you're going to need to install the basekit unfortunately. I see that in the gen_linux.sh script the CUDA libraries are shipped with ollama, so it should be possible to do it, we would just need to look at licensing restrictions and file size of the oneAPI libraries to see if it's viable, since they chose not to ship the ROCm ones due to file size.

I have not tested Vulkan yet, but I suspect it's going to be slower. Will report back on this later after testing though.

@felipeagc
Copy link
Author

I saw #403 (comment) but I haven't tried it yet.

@Leo512bit great, I'll give it a try.

@felipeagc
Copy link
Author

felipeagc commented Feb 12, 2024

These are the oneAPI libraries we would need to bundle with ollama:

Library Size
libOpenCL.so 0.06M
libmkl_core.so 68M
libmkl_sycl_blas.so 97M
libmkl_intel_ilp64.so 20M
libmkl_tbb_thread.so 31M
libtbb.so 3.7M
libsvml.so 26M
libirng.so 1.1M
libintlc.so 0.39M
libsycl.so 4.2M
libimf.so 4.4M
Total 255.85M

Would this be considered too big?

I also saw this comment in gen_linux.sh regarding the CUDA libraries:

# Cary the CUDA libs as payloads to help reduce dependency burden on users
#
# TODO - in the future we may shift to packaging these separately and conditionally
#        downloading them in the install script.

@felipeagc
Copy link
Author

A few updates: I tried getting this to work on Windows, but no success yet. I got it to build ollama and link to the oneAPI libraries, but I'm still having problems with llama.cpp not seeing the GPU. Running the main example with SYCL enabled from the llama.cpp repository "works", but I get no output, which is strange.

I also tried to run it on WSL2, but I'm getting a segfault in Intel's Level Zero, which is the library I used to query information about the GPU. Intel says WSL2 is supported, so I'll have to look into this a bit more.

@chsasank
Copy link

Can you please write down build instructions on Ubuntu? I'll help you with some feedback and benchmarks.

@felipeagc
Copy link
Author

felipeagc commented Feb 13, 2024

Can you please write down build instructions on Ubuntu? I'll help you with some feedback and benchmarks.

@chsasank Sure:

  1. Install the oneAPI Base Toolkit: https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2024-0/install-with-command-line.html (be sure to install as root to /opt/intel/oneapi, or install using apt, there's also a section for that on the website)
  2. Add yourself to the video and render groups: sudo usermod <username> -aG video and sudo usermod <username> -aG render (be sure to log out and log back in for this to take effect)
  3. Install cmake and make
  4. Build ollama:
git clone https://github.com/felipeagc/ollama.git
cd ollama
go generate ./...
go build .
  1. That's it!

I'm not even sure if it's going to work on ubuntu yet, I only tried on Arch Linux. I tried running on ubuntu on WSL2, but sadly I found out that my A750 does not support virtualization. Anyway, please tell me if there is any problem :)

As for benchmarks, this is my first time running LLMs locally so I have no point of reference. I'm getting about 6 tokens/sec on my CPU (Ryzen 5 5600G) and about 20 tokens/sec on my GPU (Intel ARC A750 8GB) running llama2 7b. I haven't measured exact numbers, but interestingly my Macbook Air M1 16GB has very similar speed to the A750, I'm not sure that should be the case, I'd expect the dedicated GPU to be faster than a laptop.

EDIT: measured the speed on the Macbook Air M1 and it's doing around 13 tokens/sec on the same models.

@chsasank
Copy link

chsasank commented Feb 13, 2024

I have Arc 770 card and I use OneAPI samples to do some benchmarks. Follow last steps of this tutorial (https://chsasank.com/intel-arc-gpu-driver-oneapi-installation.html) to do benchmarks of fp16 matrix multiplication. I'll meanwhile build for Arc 770 and get with some results.

@chsasank
Copy link

Here are benchmarks on my Arc 770 16 GB for reference:

(base) sasank@arc-reactor:~/oneAPI-samples/Libraries/oneMKL/matrix_mul_mkl$ ./matrix_mul_mkl half 4096
oneMKL DPC++ GEMM benchmark
---------------------------
Device:                  Intel(R) Arc(TM) A770 Graphics
Core/EU count:           512
Maximum clock frequency: 2400 MHz

Benchmarking (4096 x 4096) x (4096 x 4096) matrix multiplication, half precision
 -> Initializing data...
 -> Warmup...
 -> Timing...

Average performance: 58.7353TF
(base) sasank@arc-reactor:~/oneAPI-samples/Libraries/oneMKL/matrix_mul_mkl$ ./matrix_mul_mkl single 4096
oneMKL DPC++ GEMM benchmark
---------------------------
Device:                  Intel(R) Arc(TM) A770 Graphics
Core/EU count:           512
Maximum clock frequency: 2400 MHz

Benchmarking (4096 x 4096) x (4096 x 4096) matrix multiplication, single precision
 -> Initializing data...
 -> Warmup...
 -> Timing...

Average performance: 16.4633TF

On M2, matmul tflops is around 1 or 2. Check this: https://gist.github.com/chsasank/407df67ac0c848d6259f0340887648a9

I will also replicate above using Intel Pytorch Extensions.

@felipeagc
Copy link
Author

@chsasank It would be cool if you could benchmark llama.cpp against https://github.com/intel-analytics/BigDL from Intel to see if there's an advantage to using their first party solution.

@chsasank
Copy link

chsasank commented Feb 13, 2024

Making a list of benchmark comparisons:

  • OneMKL Tflops
  • Pytorch tflops
  • llama.cpp mistral-7b int8 tok/s
  • Big DL mistral-7b int8 tok/s

Lemme know if I should add anything else. Meanwhile, can you also reproduce matrix_mul_mkl on your arc 750 dev env?

@chsasank
Copy link

I have done benchmarks of mistral 7b int4 for M2 Air, Intel 12400 and Arc 770 16GB. I used llama-bench and mistral 7b model from here to find tok/s for prompt and text generation tok/s.

On M2 Air

model size params backend ngl test t/s
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 128 144.47 ± 0.22
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 256 142.95 ± 1.17
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 512 141.36 ± 0.67
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 tg 128 20.06 ± 0.66
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 tg 256 20.26 ± 0.17
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 tg 512 13.96 ± 1.62

On Intel 12400 (compiled with sycl but made num-gpu-layers (ngl) = 0)

model size params backend ngl test t/s
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 0 pp 128 18.60 ± 3.07
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 0 pp 256 20.82 ± 0.14
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 0 pp 512 22.48 ± 0.16
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 0 tg 128 10.78 ± 0.02
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 0 tg 256 10.76 ± 0.02
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 0 tg 512 10.69 ± 0.01

On Arc 770

model size params backend ngl test t/s
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 pp 128 407.14 ± 58.05
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 pp 256 583.57 ± 78.24
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 pp 512 757.99 ± 1.48
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 tg 128 24.74 ± 0.27
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 tg 256 24.65 ± 0.20
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 tg 512 21.46 ± 2.39

I compiled llama.cpp with commit in the PR. Good news is prompt processing time is somewhat high. Bade news is text generation on Arc GPUs is very low. I will do further analysis and create a issue on llama.cpp repo.

@ddpasa
Copy link

ddpasa commented Feb 13, 2024

These are the oneAPI libraries we would need to bundle with ollama:
Library Size
libOpenCL.so 0.06M
libmkl_core.so 68M
libmkl_sycl_blas.so 97M
libmkl_intel_ilp64.so 20M
libmkl_tbb_thread.so 31M
libtbb.so 3.7M
libsvml.so 26M
libirng.so 1.1M
libintlc.so 0.39M
libsycl.so 4.2M
libimf.so 4.4M
Total 255.85M

Would this be considered too big?

I also saw this comment in gen_linux.sh regarding the CUDA libraries:

# Cary the CUDA libs as payloads to help reduce dependency burden on users
#
# TODO - in the future we may shift to packaging these separately and conditionally
#        downloading them in the install script.

Would this bundle something that would work on my laptop without needing to install oneapi? If so, I'm eager to try this out

@felipeagc
Copy link
Author

@chsasank Here are the results from my A750 on the same model you tested:

model size params backend ngl test t/s
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 pp 128 225.73 ± 40.61
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 pp 256 447.46 ± 2.89
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 pp 512 737.13 ± 27.46
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 tg 128 19.64 ± 0.05
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 tg 256 19.64 ± 0.06
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 tg 512 19.50 ± 0.01

(this is with F16 turned on)

@felipeagc
Copy link
Author

Would this bundle something that would work on my laptop without needing to install oneapi? If so, I'm eager to try this out

@ddpasa Yes, but I haven't configured bundling of the libraries yet. I'll try doing this today. Out of curiosity, which GPU do you have on your laptop?

@ddpasa
Copy link

ddpasa commented Feb 13, 2024

Would this bundle something that would work on my laptop without needing to install oneapi? If so, I'm eager to try this out

@ddpasa Yes, but I haven't configured bundling of the libraries yet. I'll try doing this today. Out of curiosity, which GPU do you have on your laptop?

it's an Iris Plus G7, works really well with ncnn, I'm hoping for a similar experience.

@felipeagc
Copy link
Author

@ddpasa I couldn't get the oneAPI libraries to work when bundled with ollama, I think your best bet is just to install the base toolkit unfortunately.

llama_model_load: error loading model: No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (PI_ERROR_DEVICE_NOT_FOUND)
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/felipe/.ollama/models/blobs/sha256:7247a2b9058b98b6b83d7ae5fad3a56be827d0df8cf5e6578947c519f539e9f0'
{"timestamp":1707854298,"level":"ERROR","function":"load_model","line":378,"message":"unable to load model","model":"/home/felipe/.ollama/models/blobs/sha256:7247a2b9058b98b6b83d7ae5fad3a56be827d0df8cf5e6578947c519f539e9f0"}
time=2024-02-13T16:58:18.032-03:00 level=WARN source=llm.go:162 msg="Failed to load dynamic library /tmp/ollama204219166/oneapi/libext_server.so  error loading model /home/felipe/.ollama/models/blobs/sha256:7247a2b9058b98b6b83d7ae5fad3a56be827d0df8cf5e6578947c519f539e9f0"

@felipeagc
Copy link
Author

Update: added support for building oneAPI-enabled docker images.

@chsasank @ddpasa I also tested my A750 with llama.cpp's Vulkan backend and the results are interesting:

  • Vulkan results on Linux:
llama_print_timings:      sample time =      62.57 ms /   400 runs   (    0.16 ms per token,  6393.15 tokens per second)
llama_print_timings: prompt eval time =     574.71 ms /    14 tokens (   41.05 ms per token,    24.36 tokens per second)
llama_print_timings:        eval time =   15652.19 ms /   399 runs   (   39.23 ms per token,    25.49 tokens per second)
  • Vulkan results on Windows:
llama_print_timings:      sample time =      62.56 ms /   400 runs   (    0.16 ms per token,  6393.96 tokens per second)
llama_print_timings: prompt eval time =     548.28 ms /    14 tokens (   39.16 ms per token,    25.53 tokens per second)
llama_print_timings:        eval time =   13772.47 ms /   399 runs   (   34.52 ms per token,    28.97 tokens per second)

Both are faster than the SYCL version, and Windows is slightly faster.

@chsasank
Copy link

chsasank commented Feb 14, 2024

Vulkan results are interesting! Did you follow the instructions from here? https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#vulkan

I will reproduce the results with llama-bench.

By the way, I created an issue about performance at ggerganov/llama.cpp#5480. I think we need a performant baseline that utilizes GPU well.

@felipeagc
Copy link
Author

Vulkan results are interesting! Did you follow the instructions from here? https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#vulkan

@chsasank Yes, and I tried running llama-bench with Vulkan but got really bad results (around 3 tok/s), with the last run not even finishing, which is strange. But running the main example works just fine and it's faster than SYCL.

By the way, I created an issue about performance at ggerganov/llama.cpp#5480. I think we need a performant baseline that utilizes GPU well.

Indeed, my initial guess was that the current best performing solution was BigDL-LLM, simply because it's made by Intel. It's a pain to install, but I got it working a couple of days ago and the performance is not all that different from llama.cpp. I did not make any precise measurements though (and I'm too lazy to go through their setup again haha). If you want to give it might give us more insight into this.

@chsasank
Copy link

chsasank commented Feb 14, 2024

I observed last run not even finishing for other tests as well. But you're right getting very slow tok/s in llama-bench. Makes you wonder if llama-bench is accurate!

Vulkan0: Intel(R) Arc(tm) A770 Graphics (DG2) | uma: 0 | fp16: 1 | warp size: 32

model size params backend ngl test t/s
llama 7B Q4_0 3.83 GiB 7.24 B Vulkan 99 pp 128 145.44 _ 3.59
llama 7B Q4_0 3.83 GiB 7.24 B Vulkan 99 pp 256 176.31 _ 3.35
llama 7B Q4_0 3.83 GiB 7.24 B Vulkan 99 pp 512 190.55 _ 1.63
llama 7B Q4_0 3.83 GiB 7.24 B Vulkan 99 tg 128 5.14 _ 0.01
llama 7B Q4_0 3.83 GiB 7.24 B Vulkan 99 tg 256 5.14 _ 0.04

I too tried installing bigdl and indeed it's a bit of pain. Besides examples in the repo are neither straightforward nor self contained. I don't think the assumption that the first party repos having good performance is really accurate right now.

So far I have seen that Intel Pytorch extensions (IPEX) are pretty performant. I have done some benchmarks and found that matmul flops match OneMKL because pytorch essentially is a wrapper over them:

OneMKL:

$ ./matrix_mul_mkl single 4096
oneMKL DPC++ GEMM benchmark
---------------------------
Device:                  Intel(R) Arc(TM) A770 Graphics
Core/EU count:           512
Maximum clock frequency: 2400 MHz

Benchmarking (4096 x 4096) x (4096 x 4096) matrix multiplication, single precision
 -> Initializing data...
 -> Warmup...
 -> Timing...

Average performance: 15.4866TF

PyTorch

$ python benchmark.py --device xpu
benchmarking xpu
size, elapsed_time, tflops
256, 0.0060196399688720705, 0.005574159280872616
304, 0.002283787727355957, 0.024603393444561694
362, 0.0041447639465332035, 0.0228905330252539
430, 0.007672405242919922, 0.020725443321276304
512, 0.00035498142242431643, 0.7561957867167869
608, 0.0032821416854858397, 0.13695673955448423
724, 0.0032795190811157225, 0.23143846070923876
861, 0.00038845539093017577, 3.286232581154881
1024, 0.00038137435913085935, 5.630907261028377
1217, 0.007856798171997071, 0.4588345719314405
1448, 0.0006739616394042969, 9.009496132995023
1722, 0.0006894111633300781, 14.813276371491625
2048, 0.011164355278015136, 1.538814267029877
2435, 0.0019340753555297852, 14.929783199729785
2896, 0.0031573295593261717, 15.38529233621309
3444, 0.005527663230895996, 14.780116182070135
4096, 0.008821868896484375, 15.579346631049024
4870, 0.016022467613220216, 14.417417564907332
5792, 0.025812244415283202, 15.055316381008176
6888, 0.042959856986999514, 15.214111125690918
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 8.513927459716797e-05, 98.52806521655559
0.00593164, 8.761882781982422e-05, 135.39647008740135
0.008388608, 0.00010061264038085938, 166.75057862005687
0.01186328, 0.0001112222671508789, 213.32562811198284
0.016777216, 0.00013146400451660156, 255.23664917542257
0.023726564, 0.00016138553619384765, 294.03581708215694
0.033554432, 0.00020439624786376952, 328.32727949452465
0.047453132, 0.00026431083679199217, 359.070650117496
0.067108864, 0.0003504037857055664, 383.0373228695053
0.094906264, 0.0004702329635620117, 403.6563633526908
0.134217728, 0.0006433010101318359, 417.27815093122234
0.189812528, 0.0008820772171020507, 430.37621722870074
0.268435456, 0.0012254953384399415, 438.0848259149137
0.37962506, 0.0017089128494262695, 444.28837916158324

LLM inference is actually pretty straight forward - see llama2.c and vanilla-llama. May be it's worth it to hack vanilla-llama from the above to work with Intel GPUs and that can be our baseline. I am also working on pure OneAPI based backend for LLM inference but paused a bit on it because llama.cpp got sycl support. I guess I have to get back to it again may be.

@felipeagc
Copy link
Author

LLM inference is actually pretty straight forward - see llama2.c and vanilla-llama. May be it's worth it to hack vanilla-llama from the above to work with Intel GPUs and that can be our baseline. I am also working on pure OneAPI based backend for LLM inference but paused a bit on it because llama.cpp got sycl support. I guess I have to get back to it again may be.

@chsasank Very interesting, I'm actually pretty new to this so I'll look at llama2.c for sure. You should definitely work on the pure oneAPI version, that would be a great project!

@taep96
Copy link

taep96 commented Feb 14, 2024

I followed the instructions and it's not working for me
image

@Leo512bit
Copy link

Leo512bit commented Feb 14, 2024 via email

@taep96
Copy link

taep96 commented Feb 14, 2024

I do have it installed
image

@Leo512bit
Copy link

Leo512bit commented Feb 14, 2024 via email

@felipeagc
Copy link
Author

I followed the instructions and it's not working for me
image

It's not finding the level zero library, which is part of Intel's driver. It should have already been installed, so maybe your linux distro installs it somewhere else. Can you locate where libze_intel_gpu.so is on your machine?

@taep96
Copy link

taep96 commented Feb 14, 2024

Turns out it's provided by intel-compute-runtime which is a separate package

@Leo512bit
Copy link

Leo512bit commented Feb 18, 2024

I tried running on ubuntu on WSL2, but sadly I found out that my A750 does not support virtualization.

Really? I thought Intel Arc supported SR-IOV, did you enable it in UEFU? I do have in A770 16GB so maybe only the fat one supports it? (I don't know haven't tried passthrough on Arc yet.)

Anyways I tried compiling on WSL2 but I got this mess. Like Why was it in my VMware in install?

@lrussell887
Copy link

lrussell887 commented Mar 2, 2024

EDIT: Please see my comment below, I was able to get past this on 22.04.

I'm testing this on a fresh install of Kubuntu 23.10. My GPU is an Arc A770 16 GB. I installed the Intel oneAPI base toolkit, and have the following go, cmake, and gcc versions.

logan@desktop:~$ go version
go version go1.22.0 linux/amd64
logan@desktop:~$ cmake --version
cmake version 3.27.4

CMake suite maintained and supported by Kitware (kitware.com/cmake).
logan@desktop:~$ gcc --version
gcc (Ubuntu 13.2.0-4ubuntu3) 13.2.0

I was able to build with go generate ./... and go build . after git cloning your repository just fine, then found that there was no GPU detected which was resolved by installing intel-opencl-icd, which should provide the intel-compute-runtime.

This is now the following output when trying to serve, it segfaults:

logan@desktop:~/ollama$ ./ollama serve
time=2024-03-01T20:51:31.295-05:00 level=INFO source=images.go:863 msg="total blobs: 0"
time=2024-03-01T20:51:31.295-05:00 level=INFO source=images.go:870 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/jmorganca/ollama/server.ChatHandler (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/jmorganca/ollama/server.EmbeddingHandler (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/jmorganca/ollama/server.ChatHandler (6 handlers)
[GIN-debug] GET    /                         --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
time=2024-03-01T20:51:31.296-05:00 level=INFO source=routes.go:999 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-03-01T20:51:31.296-05:00 level=INFO source=payload_common.go:106 msg="Extracting dynamic libraries..."
time=2024-03-01T20:51:31.498-05:00 level=INFO source=payload_common.go:145 msg="Dynamic LLM libraries [oneapi cpu_avx2 cpu cpu_avx]"
time=2024-03-01T20:51:31.499-05:00 level=INFO source=gpu.go:105 msg="Detecting GPU type"
time=2024-03-01T20:51:31.499-05:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-03-01T20:51:31.500-05:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-01T20:51:31.500-05:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library librocm_smi64.so"
time=2024-03-01T20:51:31.500-05:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-01T20:51:31.500-05:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-03-01T20:51:31.502-05:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.24595]"
SIGSEGV: segmentation violation
PC=0x0 m=14 sigcode=1 addr=0x0
signal arrived during cgo execution

goroutine 1 gp=0xc0000061c0 m=14 mp=0xc000600808 [syscall]:
runtime.cgocall(0x9d8aed, 0xc0000d5790)
        /snap/go/10506/src/runtime/cgocall.go:157 +0x4b fp=0xc0000d5768 sp=0xc0000d5730 pc=0x40a74b
github.com/jmorganca/ollama/gpu._Cfunc_oneapi_init(0x7f33a4000ca0, 0xc000436000)
        _cgo_gotypes.go:433 +0x3f fp=0xc0000d5790 sp=0xc0000d5768 pc=0x7d6b1f
github.com/jmorganca/ollama/gpu.LoadOneapiMgmt.func2(0x7f33a4000ca0, 0xc000436000)
        /home/logan/ollama/gpu/gpu.go:375 +0x4a fp=0xc0000d57c0 sp=0xc0000d5790 pc=0x7d9c6a
github.com/jmorganca/ollama/gpu.LoadOneapiMgmt({0xc00042a020, 0x1, 0xc000444020?})
        /home/logan/ollama/gpu/gpu.go:375 +0x205 fp=0xc0000d5880 sp=0xc0000d57c0 pc=0x7d9b25
github.com/jmorganca/ollama/gpu.initGPUHandles()
        /home/logan/ollama/gpu/gpu.go:128 +0x191 fp=0xc0000d5920 sp=0xc0000d5880 pc=0x7d6fb1
github.com/jmorganca/ollama/gpu.GetGPUInfo()
        /home/logan/ollama/gpu/gpu.go:143 +0xc5 fp=0xc0000d5aa8 sp=0xc0000d5920 pc=0x7d7205
github.com/jmorganca/ollama/gpu.CheckVRAM()
        /home/logan/ollama/gpu/gpu.go:265 +0x25 fp=0xc0000d5bb8 sp=0xc0000d5aa8 pc=0x7d87e5
github.com/jmorganca/ollama/server.Serve({0x311f110, 0xc0004457c0})
        /home/logan/ollama/server/routes.go:1021 +0x45a fp=0xc0000d5cc0 sp=0xc0000d5bb8 pc=0x9bb5ba
github.com/jmorganca/ollama/cmd.RunServer(0xc000476400?, {0x35b57e0?, 0x4?, 0xaf983f?})
        /home/logan/ollama/cmd/cmd.go:705 +0x199 fp=0xc0000d5d58 sp=0xc0000d5cc0 pc=0x9cee39
github.com/spf13/cobra.(*Command).execute(0xc000470f08, {0x35b57e0, 0x0, 0x0})
        /home/logan/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:940 +0x882 fp=0xc0000d5e78 sp=0xc0000d5d58 pc=0x77e0e2
github.com/spf13/cobra.(*Command).ExecuteC(0xc000470308)
        /home/logan/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc0000d5f30 sp=0xc0000d5e78 pc=0x77e925
github.com/spf13/cobra.(*Command).Execute(...)
        /home/logan/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
        /home/logan/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:985
main.main()
        /home/logan/ollama/main.go:11 +0x4d fp=0xc0000d5f50 sp=0xc0000d5f30 pc=0x9d6cad
runtime.main()
        /snap/go/10506/src/runtime/proc.go:271 +0x29d fp=0xc0000d5fe0 sp=0xc0000d5f50 pc=0x440dbd
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000d5fe8 sp=0xc0000d5fe0 pc=0x473e01

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc000068fa8 sp=0xc000068f88 pc=0x4411ee
runtime.goparkunlock(...)
        /snap/go/10506/src/runtime/proc.go:408
runtime.forcegchelper()
        /snap/go/10506/src/runtime/proc.go:326 +0xb3 fp=0xc000068fe0 sp=0xc000068fa8 pc=0x441073
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000068fe8 sp=0xc000068fe0 pc=0x473e01
created by runtime.init.6 in goroutine 1
        /snap/go/10506/src/runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc000069780 sp=0xc000069760 pc=0x4411ee
runtime.goparkunlock(...)
        /snap/go/10506/src/runtime/proc.go:408
runtime.bgsweep(0xc00003a070)
        /snap/go/10506/src/runtime/mgcsweep.go:318 +0xdf fp=0xc0000697c8 sp=0xc000069780 pc=0x42c83f
runtime.gcenable.gowrap1()
        /snap/go/10506/src/runtime/mgc.go:203 +0x25 fp=0xc0000697e0 sp=0xc0000697c8 pc=0x421125
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000697e8 sp=0xc0000697e0 pc=0x473e01
created by runtime.gcenable in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x8334575?, 0x0?, 0x0?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc000069f78 sp=0xc000069f58 pc=0x4411ee
runtime.goparkunlock(...)
        /snap/go/10506/src/runtime/proc.go:408
runtime.(*scavengerState).park(0x3553ce0)
        /snap/go/10506/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000069fa8 sp=0xc000069f78 pc=0x42a1c9
runtime.bgscavenge(0xc00003a070)
        /snap/go/10506/src/runtime/mgcscavenge.go:658 +0x59 fp=0xc000069fc8 sp=0xc000069fa8 pc=0x42a779
runtime.gcenable.gowrap2()
        /snap/go/10506/src/runtime/mgc.go:204 +0x25 fp=0xc000069fe0 sp=0xc000069fc8 pc=0x4210c5
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000069fe8 sp=0xc000069fe0 pc=0x473e01
created by runtime.gcenable in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:204 +0xa5

goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]:
runtime.gopark(0xc000068648?, 0x4144e5?, 0xa8?, 0x1?, 0xc0000061c0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc000068620 sp=0xc000068600 pc=0x4411ee
runtime.runfinq()
        /snap/go/10506/src/runtime/mfinal.go:194 +0x107 fp=0xc0000687e0 sp=0xc000068620 pc=0x420167
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000687e8 sp=0xc0000687e0 pc=0x473e01
created by runtime.createfing in goroutine 1
        /snap/go/10506/src/runtime/mfinal.go:164 +0x3d

goroutine 6 gp=0xc000386e00 m=nil [select, locked to thread]:
runtime.gopark(0xc00006a7a8?, 0x2?, 0x89?, 0x14?, 0xc00006a794?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc00006a638 sp=0xc00006a618 pc=0x4411ee
runtime.selectgo(0xc00006a7a8, 0xc00006a790, 0x0?, 0x0, 0x0?, 0x1)
        /snap/go/10506/src/runtime/select.go:327 +0x725 fp=0xc00006a758 sp=0xc00006a638 pc=0x452645
runtime.ensureSigM.func1()
        /snap/go/10506/src/runtime/signal_unix.go:1034 +0x19f fp=0xc00006a7e0 sp=0xc00006a758 pc=0x46b25f
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00006a7e8 sp=0xc00006a7e0 pc=0x473e01
created by runtime.ensureSigM in goroutine 1
        /snap/go/10506/src/runtime/signal_unix.go:1017 +0xc8

goroutine 18 gp=0xc000102380 m=3 mp=0xc00006f008 [syscall]:
runtime.notetsleepg(0x35b63a0, 0xffffffffffffffff)
        /snap/go/10506/src/runtime/lock_futex.go:246 +0x29 fp=0xc0000647a0 sp=0xc000064778 pc=0x412b09
os/signal.signal_recv()
        /snap/go/10506/src/runtime/sigqueue.go:152 +0x29 fp=0xc0000647c0 sp=0xc0000647a0 pc=0x470869
os/signal.loop()
        /snap/go/10506/src/os/signal/signal_unix.go:23 +0x13 fp=0xc0000647e0 sp=0xc0000647c0 pc=0x70d993
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000647e8 sp=0xc0000647e0 pc=0x473e01
created by os/signal.Notify.func1.1 in goroutine 1
        /snap/go/10506/src/os/signal/signal.go:151 +0x1f

goroutine 19 gp=0xc0001028c0 m=nil [chan receive]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc000064f18 sp=0xc000064ef8 pc=0x4411ee
runtime.chanrecv(0xc00019b3e0, 0x0, 0x1)
        /snap/go/10506/src/runtime/chan.go:583 +0x3bf fp=0xc000064f90 sp=0xc000064f18 pc=0x40cd5f
runtime.chanrecv1(0x0?, 0x0?)
        /snap/go/10506/src/runtime/chan.go:442 +0x12 fp=0xc000064fb8 sp=0xc000064f90 pc=0x40c972
github.com/jmorganca/ollama/server.Serve.func2()
        /home/logan/ollama/server/routes.go:1008 +0x25 fp=0xc000064fe0 sp=0xc000064fb8 pc=0x9bb6a5
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000064fe8 sp=0xc000064fe0 pc=0x473e01
created by github.com/jmorganca/ollama/server.Serve in goroutine 1
        /home/logan/ollama/server/routes.go:1007 +0x3f6

goroutine 20 gp=0xc000102a80 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc000065750 sp=0xc000065730 pc=0x4411ee
runtime.gcBgMarkWorker()
        /snap/go/10506/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000657e0 sp=0xc000065750 pc=0x423205
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000657e8 sp=0xc0000657e0 pc=0x473e01
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:1234 +0x1c

goroutine 21 gp=0xc000102c40 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc000065f50 sp=0xc000065f30 pc=0x4411ee
runtime.gcBgMarkWorker()
        /snap/go/10506/src/runtime/mgc.go:1310 +0xe5 fp=0xc000065fe0 sp=0xc000065f50 pc=0x423205
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000065fe8 sp=0xc000065fe0 pc=0x473e01
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:1234 +0x1c

goroutine 22 gp=0xc000102e00 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc000066750 sp=0xc000066730 pc=0x4411ee
runtime.gcBgMarkWorker()
        /snap/go/10506/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000667e0 sp=0xc000066750 pc=0x423205
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000667e8 sp=0xc0000667e0 pc=0x473e01
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:1234 +0x1c

goroutine 7 gp=0xc000387340 m=nil [GC worker (idle)]:
runtime.gopark(0x13aeacee673?, 0x1?, 0xfe?, 0x32?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc00006af50 sp=0xc00006af30 pc=0x4411ee
runtime.gcBgMarkWorker()
        /snap/go/10506/src/runtime/mgc.go:1310 +0xe5 fp=0xc00006afe0 sp=0xc00006af50 pc=0x423205
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00006afe8 sp=0xc00006afe0 pc=0x473e01
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:1234 +0x1c

goroutine 8 gp=0xc000387500 m=nil [GC worker (idle)]:
runtime.gopark(0x13aeacd4cc4?, 0x0?, 0x0?, 0x0?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc00006b750 sp=0xc00006b730 pc=0x4411ee
runtime.gcBgMarkWorker()
        /snap/go/10506/src/runtime/mgc.go:1310 +0xe5 fp=0xc00006b7e0 sp=0xc00006b750 pc=0x423205
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00006b7e8 sp=0xc00006b7e0 pc=0x473e01
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:1234 +0x1c

goroutine 9 gp=0xc0003876c0 m=nil [GC worker (idle)]:
runtime.gopark(0x13aeacd6ea8?, 0x0?, 0x0?, 0x0?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc00006bf50 sp=0xc00006bf30 pc=0x4411ee
runtime.gcBgMarkWorker()
        /snap/go/10506/src/runtime/mgc.go:1310 +0xe5 fp=0xc00006bfe0 sp=0xc00006bf50 pc=0x423205
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00006bfe8 sp=0xc00006bfe0 pc=0x473e01
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:1234 +0x1c

goroutine 23 gp=0xc000102fc0 m=nil [GC worker (idle)]:
runtime.gopark(0x13aeacd4bdd?, 0x0?, 0x0?, 0x0?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc000066f50 sp=0xc000066f30 pc=0x4411ee
runtime.gcBgMarkWorker()
        /snap/go/10506/src/runtime/mgc.go:1310 +0xe5 fp=0xc000066fe0 sp=0xc000066f50 pc=0x423205
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000066fe8 sp=0xc000066fe0 pc=0x473e01
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:1234 +0x1c

goroutine 10 gp=0xc000387880 m=nil [GC worker (idle)]:
runtime.gopark(0x13aeacd5737?, 0x1?, 0x8e?, 0x67?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc0004a4750 sp=0xc0004a4730 pc=0x4411ee
runtime.gcBgMarkWorker()
        /snap/go/10506/src/runtime/mgc.go:1310 +0xe5 fp=0xc0004a47e0 sp=0xc0004a4750 pc=0x423205
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0004a47e8 sp=0xc0004a47e0 pc=0x473e01
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:1234 +0x1c

goroutine 34 gp=0xc000504000 m=nil [GC worker (idle)]:
runtime.gopark(0x13aeacef7c9?, 0x1?, 0x4a?, 0x8c?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc0004a0750 sp=0xc0004a0730 pc=0x4411ee
runtime.gcBgMarkWorker()
        /snap/go/10506/src/runtime/mgc.go:1310 +0xe5 fp=0xc0004a07e0 sp=0xc0004a0750 pc=0x423205
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0004a07e8 sp=0xc0004a07e0 pc=0x473e01
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:1234 +0x1c

goroutine 35 gp=0xc0005041c0 m=nil [GC worker (idle)]:
runtime.gopark(0x35b7640?, 0x1?, 0xc5?, 0xdc?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc0004a0f50 sp=0xc0004a0f30 pc=0x4411ee
runtime.gcBgMarkWorker()
        /snap/go/10506/src/runtime/mgc.go:1310 +0xe5 fp=0xc0004a0fe0 sp=0xc0004a0f50 pc=0x423205
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0004a0fe8 sp=0xc0004a0fe0 pc=0x473e01
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:1234 +0x1c

goroutine 11 gp=0xc000387a40 m=nil [GC worker (idle)]:
runtime.gopark(0x13aeacd5cda?, 0x1?, 0x90?, 0x9f?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc0004a4f50 sp=0xc0004a4f30 pc=0x4411ee
runtime.gcBgMarkWorker()
        /snap/go/10506/src/runtime/mgc.go:1310 +0xe5 fp=0xc0004a4fe0 sp=0xc0004a4f50 pc=0x423205
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0004a4fe8 sp=0xc0004a4fe0 pc=0x473e01
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:1234 +0x1c

goroutine 24 gp=0xc000103180 m=nil [GC worker (idle)]:
runtime.gopark(0x13aeaceebed?, 0x1?, 0x19?, 0x1c?, 0x0?)
        /snap/go/10506/src/runtime/proc.go:402 +0xce fp=0xc000067750 sp=0xc000067730 pc=0x4411ee
runtime.gcBgMarkWorker()
        /snap/go/10506/src/runtime/mgc.go:1310 +0xe5 fp=0xc0000677e0 sp=0xc000067750 pc=0x423205
runtime.goexit({})
        /snap/go/10506/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000677e8 sp=0xc0000677e0 pc=0x473e01
created by runtime.gcBgMarkStartWorkers in goroutine 1
        /snap/go/10506/src/runtime/mgc.go:1234 +0x1c

rax    0x0
rbx    0xc000436048
rcx    0x24a870
rdx    0x1
rdi    0x0
rsi    0x0
rbp    0x7f33d8ff8de0
rsp    0x7f33d8ff8be8
r8     0x7f33aa405260
r9     0x0
r10    0x7f33aa404ac8
r11    0x7f33a4000d20
r12    0x7f33d8ff8d00
r13    0x0
r14    0xc0000061c0
r15    0x4924912cd948
rip    0x0
rflags 0x10246
cs     0x33
fs     0x0
gs     0x0

I confirmed that the oneAPI environment can be loaded manually:

logan@desktop:~$ source /opt/intel/oneapi/setvars.sh
 
:: initializing oneAPI environment ...
   bash: BASH_VERSION = 5.2.15(1)-release
   args: Using "$@" for setvars.sh arguments: 
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

I also tested this under Ubuntu 22.04 in WSL earlier on which surprisingly enough had the same result -- a segfault after trying to serve it.

@lrussell887
Copy link

lrussell887 commented Mar 2, 2024

EDIT: I missed something silly -- I didn't run source /opt/intel/oneapi/setvars.sh prior to ./ollama serve. Though I was under the impression ollama should now initiate the environment on its own? In any case, it's now working and the following I suppose can be seen as a tutorial.

image


I came to the conclusion the segfault was related to drivers, and I've since installed Kubuntu 22.04 since 22.04 is what Intel seems to have validated everything on. Doing so helped me get farther along, but I'm still running into issues.

To detail my setup process:

EDIT: Be sure to initialize your oneAPI environment with source /opt/intel/oneapi/setvars.sh prior to running ollama.

Finally ./ollama serve runs normally, saying Intel GPU detected. In another terminal window I run ./ollama run llama2, which ultimately leads to a fatal error: which should now be working.

logan@desktop:~/ollama$ ./ollama serve
time=2024-03-01T22:26:59.579-05:00 level=INFO source=images.go:863 msg="total blobs: 6"
time=2024-03-01T22:26:59.582-05:00 level=INFO source=images.go:870 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/jmorganca/ollama/server.ChatHandler (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/jmorganca/ollama/server.EmbeddingHandler (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/jmorganca/ollama/server.ChatHandler (6 handlers)
[GIN-debug] GET    /                         --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
time=2024-03-01T22:26:59.583-05:00 level=INFO source=routes.go:999 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-03-01T22:26:59.583-05:00 level=INFO source=payload_common.go:106 msg="Extracting dynamic libraries..."
time=2024-03-01T22:26:59.621-05:00 level=INFO source=payload_common.go:145 msg="Dynamic LLM libraries [oneapi cpu cpu_avx cpu_avx2]"
time=2024-03-01T22:26:59.621-05:00 level=INFO source=gpu.go:105 msg="Detecting GPU type"
time=2024-03-01T22:26:59.621-05:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-03-01T22:26:59.623-05:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-01T22:26:59.623-05:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library librocm_smi64.so"
time=2024-03-01T22:26:59.623-05:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-01T22:26:59.623-05:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-03-01T22:26:59.625-05:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.27191.42]"
time=2024-03-01T22:26:59.700-05:00 level=INFO source=gpu.go:130 msg="Intel GPU detected"
time=2024-03-01T22:26:59.700-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[GIN] 2024/03/01 - 22:27:49 | 200 |     870.771µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/03/01 - 22:27:49 | 200 |    1.840479ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/03/01 - 22:27:49 | 200 |     269.711µs |       127.0.0.1 | POST     "/api/show"
time=2024-03-01T22:27:49.696-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-01T22:27:49.696-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-01T22:27:49.697-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
loading library /tmp/ollama283326281/oneapi/libext_server.so
time=2024-03-01T22:27:49.890-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama283326281/oneapi/libext_server.so"
time=2024-03-01T22:27:49.890-05:00 level=INFO source=dyn_ext_server.go:145 msg="Initializing llama server"
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 16225243136
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 12,   max work group size 67108864,   max sub group size 64,  global mem size 33536741376
  Device 2: Intel(R) Arc(TM) A770 Graphics,     compute capability 3.0,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 16225243136
  Device 3: AMD Ryzen 5 5600X3D 6-Core Processor           ,    compute capability 3.0,
        max compute_units 12,   max work group size 8192,       max sub group size 64,  global mem size 33536741376
Using device 0 (Intel(R) Arc(TM) A770 Graphics) as main device
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /home/logan/.ollama/models/blobs/sha256:8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,61249]   = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:            buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:            KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    12.01 MiB
llama_new_context_with_model:            compute buffer size =   171.60 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 3
Intel MKL FATAL ERROR: Error on loading function 'clGetPlatformIDs'.

Just to sanity check, I tested PyTorch per https://intel.github.io/intel-extension-for-pytorch/index.html#installation and the GPU is detected:

(test-venv) logan@desktop:~$ python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"
/home/logan/test-venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
2.1.0a0+cxx11.abi
2.1.10+xpu
[0]: _DeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15473MB, max_compute_units=512, gpu_eu_count=512)

I also added my user to the video and render groups to no effect. Any help would be appreciated!

@6543
Copy link

6543 commented Mar 6, 2024

this pull already conflicts :/ what hold things back to get it merged?

@felipeagc
Copy link
Author

Hey everyone, I'm currently in the process of moving, so I don't have access to my PC with an Intel Arc and won't have for a little while. If anyone wants to take over this PR, please feel free.

@sgwhat
Copy link

sgwhat commented Mar 7, 2024

Can this PR generate output correctly on Intel Arc (ubuntu)? I got some error outputs like:

ollama run example "What is your favourite condiment?"
 !##"##!       "!▅"! $   #"# ##  ▅"#! 

@sgwhat
Copy link

sgwhat commented Mar 7, 2024

Can you please write down build instructions on Ubuntu? I'll help you with some feedback and benchmarks.

Hi @chsasank , can you run ollama normally on Ubuntu with an Intel ARC graphics card?

@6543
Copy link

6543 commented Mar 7, 2024

ollama run dolphin-mixtral:latest "What is your favourite condiment?"

works just fine for me ...

@sgwhat
Copy link

sgwhat commented Mar 7, 2024

ollama run dolphin-mixtral:latest "What is your favourite condiment?"

works just fine for me ...

I see... you built a docker image? By the way, did you run it on ubuntu?

@6543
Copy link

6543 commented Mar 7, 2024

no I build it on archlinux as normal binnary

@shanoaice
Copy link

Just one additional thought, does SYCL only work on Intel GPUs or does it work also on AMD GPUs? ROCm sometimes have some odd quirks on Windows (such as GFX version mismatch) that prevent it from working properly, and it will be good to see if OpenCL can be used with AMD GPUs which has much better support.

Though this require SYCL support to run on native Windows, since either ROCm or mesa-opencl stuff does not seem to support calling AMD GPU inside WSL2, no matter virtualization is enabled or not.

@Oscilloscope98
Copy link

I too tried installing bigdl and indeed it's a bit of pain.

Hi @chsasank and @felipeagc,

Thank you for sharing your concerns about installing BigDL-LLM :) I'm on the development team and we’d love to help out. Could you share more about the installation problems you're facing?

We have also updated our installation guide on Intel GPUs recently, and added detailed Quickstart guide (regarding installation, benchmarking, etc) that might help. Please feel free to review them at your convenience and share any thoughts or feedback you might have. Thank you!

@zhewang1-intc
Copy link
Contributor

zhewang1-intc commented Mar 19, 2024

Hi @felipeagc, thank you for making it possible for this outstanding ollama project to run on Intel GPUs.

Let’s work together to push the progress of this PR. I have rebased the latest ollama main branch and verified that it works well on Ubuntu 22.04 + Arc 770 GPU. I also create a pr2, once it gets merged (or create a new pr based on pr2 to ollama), we can further discuss with ollama's maintainers how to proceed with merging pr2458 into the ollama main branch.

I’m new to this project and unfamiliar with many aspects, so I appreciate any guidance from the community. Thank you!

@jiriks74
Copy link

jiriks74 commented Mar 20, 2024

Hello,
I'd just like to thanks for the work on this.
I'm using a laptop with an Arc A370M dGPU. I've tried building the docker image and running it but my dGPU isn't detected.

Logs
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers)
time=2024-03-20T00:24:25.448Z level=INFO source=routes.go:999 msg="Listening on [::]:11434 (version 0.0.0)"
[GIN-debug] POST   /api/chat                 --> github.com/jmorganca/ollama/server.ChatHandler (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/jmorganca/ollama/server.EmbeddingHandler (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/jmorganca/ollama/server.ChatHandler (6 handlers)
[GIN-debug] GET    /                         --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
time=2024-03-20T00:24:25.448Z level=INFO source=payload_common.go:106 msg="Extracting dynamic libraries..."
time=2024-03-20T00:24:27.879Z level=INFO source=payload_common.go:145 msg="Dynamic LLM libraries [rocm_v5 cpu_avx cpu cuda_v11 oneapi cpu_avx2 rocm_v6]"
time=2024-03-20T00:24:27.879Z level=INFO source=gpu.go:105 msg="Detecting GPU type"
time=2024-03-20T00:24:27.879Z level=INFO source=gpu.go:285 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-03-20T00:24:27.881Z level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-20T00:24:27.881Z level=INFO source=gpu.go:285 msg="Searching for GPU management library librocm_smi64.so"
time=2024-03-20T00:24:27.881Z level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-20T00:24:27.881Z level=INFO source=gpu.go:285 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-03-20T00:24:27.881Z level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-20T00:24:27.881Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

AFAIK this dGPU supports OneAPI/SYCL and should work.

I'm happy to test this project for you on my dGPU but I am not familiar enough with this project, GPU programming, etc., to take this over.

EDIT: The dGPUs are in the container:
image

EDIT1: From using llama.cpp directly I see that it's an old issue I reported (and forgot about because the developer didn't respond). ggerganov/llama.cpp#6808

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet