-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling ollama to run on Intel GPUs with SYCL backend #3278
Conversation
edfbd44
to
3e1cc67
Compare
@jmorganca @mxyng Could you give a review? |
fc7bffd
to
3f06c4a
Compare
I tried this on my Intel integrated GPU. I am able to build and run llama.cpp with Intel GPU support without too much problems following this Tutorial: https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md docker run -p 8080:8080 -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 llama-cpp-sycl-server -m "/app/models/orca-2-13b.Q5_K_M.gguf" -c 512 --host 0.0.0.0 --port 8080 -ngl 41
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 4 SYCL devices:
| | | |Compute |Max compute|Max work|Max sub| |
|ID| Device Type| Name|capability|units |group |group |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]| Intel(R) Iris(R) Xe Graphics| 1.3| 96| 512| 32| 26669551616|
| 1| [opencl:gpu:0]| Intel(R) Iris(R) Xe Graphics| 3.0| 96| 512| 32| 26669551616|
| 2| [opencl:cpu:0]|11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz| 3.0| 8| 8192| 64| 33336942592|
| 3| [opencl:acc:0]| Intel(R) FPGA Emulation Device| 1.2| 8|67108864| 64| 33336942592|
{"build":0,"commit":"unknown","function":"main","level":"INFO","line":2756,"msg":"build info","tid":"138119685482496","timestamp":1711301931}
{"function":"main","level":"INFO","line":2763,"msg":"system info","n_threads":4,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"138119685482496","timestamp":1711301931,"total_threads":8}
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from /app/models/orca-2-13b.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 5120
llama_model_loader: - kv 4: llama.block_count u32 = 40
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 40
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32003] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32003] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32003] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q5_K: 241 tensors
llama_model_loader: - type q6_K: 41 tensors
llm_load_vocab: special tokens definition check successful ( 262/32003 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32003
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 5120
llm_load_print_meta: n_embd_v_gqa = 5120
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 8.60 GiB (5.67 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:96
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size = 0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: SYCL0 buffer size = 8694.22 MiB
llm_load_tensors: CPU buffer size = 107.43 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: SYCL0 KV buffer size = 400.00 MiB
llama_new_context_with_model: KV self size = 400.00 MiB, K (f16): 200.00 MiB, V (f16): 200.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 62.51 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 81.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 11.00 MiB
llama_new_context_with_model: graph nodes = 1324
llama_new_context_with_model: graph splits = 2 But the Version of llama.cpp started by Ollama fails to detect/use my GPU. ./ollama serve
time=2024-03-24T11:21:16.096-06:00 level=INFO source=images.go:863 msg="total blobs: 6"
time=2024-03-24T11:21:16.096-06:00 level=INFO source=images.go:870 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env: export GIN_MODE=release
- using code: gin.SetMode(gin.ReleaseMode)
[GIN-debug] POST /api/pull --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST /api/generate --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers)
[GIN-debug] POST /api/chat --> github.com/jmorganca/ollama/server.ChatHandler (5 handlers)
[GIN-debug] POST /api/embeddings --> github.com/jmorganca/ollama/server.EmbeddingHandler (5 handlers)
[GIN-debug] POST /api/create --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST /api/push --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST /api/copy --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST /api/show --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST /api/blobs/:digest --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD /api/blobs/:digest --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] POST /v1/chat/completions --> github.com/jmorganca/ollama/server.ChatHandler (6 handlers)
[GIN-debug] GET / --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] GET /api/tags --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] GET /api/version --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[GIN-debug] HEAD / --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD /api/tags --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD /api/version --> github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
time=2024-03-24T11:21:16.097-06:00 level=INFO source=routes.go:999 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-03-24T11:21:16.097-06:00 level=INFO source=payload_common.go:106 msg="Extracting dynamic libraries..."
time=2024-03-24T11:21:16.178-06:00 level=INFO source=payload_common.go:145 msg="Dynamic LLM libraries [oneapi cpu_avx2 cpu_avx cpu]"
time=2024-03-24T11:21:16.178-06:00 level=INFO source=gpu.go:105 msg="Detecting GPU type"
time=2024-03-24T11:21:16.178-06:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-03-24T11:21:16.182-06:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-24T11:21:16.182-06:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library librocm_smi64.so"
time=2024-03-24T11:21:16.183-06:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-24T11:21:16.183-06:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-03-24T11:21:16.187-06:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.27191.42]"
time=2024-03-24T11:21:16.209-06:00 level=INFO source=gpu.go:130 msg="Intel GPU detected"
time=2024-03-24T11:21:16.209-06:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-24T11:21:16.210-06:00 level=INFO source=routes.go:1022 msg="no GPU detected" Probalby due to the unified memory of the integrated GPU. I am not experienced with golang or project but a look into the routes.go file gives me the impression that the VRAM Check fails due to unified memory 1019 if runtime.GOOS == "linux" { // TODO - windows too
1020 // check compatibility to log warnings
1021 if _, err := gpu.CheckVRAM(); err != nil {
1022 slog.Info(err.Error())
1023 }
1024 } Any suggestions what i could do to work around this? update: Poking around a little bit shows that i seam to be on the right path with the unified memory. Here it states |
@semidark yes, I think there is some confusing in the logic here, we will refactor later to align with llama.cpp side. |
@zhewang1-intc I just got my hands on an Intel ARC GPU and my plan is to get it installed in a system, and try this PR out this week, then I'll do a code review for you. Update: Looks like the ARC GPUs don't work on the older mobo/CPU combos I have at the moment, so I'll need to source a newer setup to validate it. Hopefully next week I can get it up and running. |
9dd2b62
to
384bc56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked through the code in depth yet, but when I tried to run this on my test system, something doesn't work correctly during discovery.
time=2024-04-01T22:05:13.407Z level=DEBUG source=gpu.go:332 msg="gpu management search paths: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so* /usr/lib*/libze_intel_gpu.so* /home/daniel/libze_intel_gpu.so*]"
time=2024-04-01T22:05:13.408Z level=INFO source=gpu.go:360 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.28202.39]"
wiring Level-Zero management library functions in /usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.28202.39
dlsym: zesInit
dlsym: zesDriverGet
dlsym: zesDeviceGet
dlsym: zesDeviceGetProperties
dlsym: zesDeviceEnumMemoryModules
dlsym: zesMemoryGetProperties
dlsym: zesMemoryGetState
zesInit err: 2013265921
time=2024-04-01T22:05:13.430Z level=INFO source=gpu.go:406 msg="Unable to load oneAPI management library /usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.28202.39: oneapi vram init failure: 2013265921"
I was trying to emulate a user environment (not developer), so I only install the driver packages for ubuntu.
Dockerfile
Outdated
COPY --from=llm-code / /go/src/github.com/jmorganca/ollama/ | ||
WORKDIR /go/src/github.com/jmorganca/ollama/llm/generate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently cleaned up the package paths, so this is stale. replace jmorganca with ollama.
It seems that the err code |
Hi it shall work now after ggerganov/llama.cpp#6435 merged |
6b1fb68
to
aaf0a67
Compare
e8aa4a5
to
ac30dc4
Compare
f66d6f5
to
ca1ac65
Compare
64af97c
to
b4c7cce
Compare
@dhiltgen hi, could you pls take a review when you free so we can improve this pr? |
Since I'm having some difficulty successfully getting this running on my test system, let me suggest we break this into 2 pieces to reduce the rebase churn so we can make incremental progress. Let's focus this PR on the base enablement with the |
CC=icx | ||
CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL=ON -DLLAMA_SYCL_F16=OFF" | ||
BUILD_DIR="../build/linux/${ARCH}/oneapi" | ||
EXTRA_LIBS="-fsycl -Wl,-rpath,${ONEAPI_ROOT}/compiler/latest/lib,-rpath,${ONEAPI_ROOT}/mkl/latest/lib,-rpath,${ONEAPI_ROOT}/tbb/latest/lib,-rpath,${ONEAPI_ROOT}/compiler/latest/opt/oclfpga/linux64/lib -lOpenCL -lmkl_core -lmkl_sycl_blas -lmkl_intel_ilp64 -lmkl_tbb_thread -ltbb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it may be problematic if the user doesn't have oneapi installed at the exact same location as the build system. I think we'll eventually need to carry these libraries as dependency payloads like we do with cuda and rocm. (assuming that's permitted)
As an incremental step before we expose in the official builds, this may be OK, although I see you are copying the libraries below. Ideally we'd like to set this up so the user only needs the driver installed on the host, and we use the user-space library from our build to ensure things are linked properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The libraries copied to the ${BUILD_DIR}/bin directory contain the essential oneAPI dependencies required for running the llama.cpp. There's no need for users to install oneAPI themselves, as the build system handles this dependency which means build system should install oneAPI.
When the ollama binary executes, these dependencies are extracted to a temporary location on the user's machine. Upon detecting an Intel GPU driver, we appends the path to this temporary directory (containing the oneAPI dependencies) to the LD_LIBRARY_PATH environment variable. This ensures that even if oneAPI isn't installed locally, the program can still locate the necessary libraries to function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I don't think this rpath is what we want, since that's setting a runtime setting for where ld can find the libraries at runtime and assumes you have installed the libs in the same location as the build system, but we likely want it to be a relative path based on $ORIGIN
like we're doing with ROCm. We can tidy this up in a follow up.
llm/server.go
Outdated
if strings.HasPrefix(servers[i], "oneapi") { | ||
os.Setenv("Path", os.Getenv("Path")+dir) | ||
slog.Debug("append oneapi lib in Path env:", os.Getenv("Path")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem necessary. This block should cover it, and if you need additional deps wired up, use GpuInfo.DependencyPath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are right, i remove it and ollama also works even if we didn't source the oneapi-related env script.
812e8f2
to
fd5971b
Compare
@dhiltgen Anything needing before merged? |
I'll merge this once we finalize the 0.1.39 release |
Hi, I am submitting this pr to enable ollama to run on Intel GPUs with SYCL as the backend. This pr was originally started by @felipeagc who is currently unable to actively participate due to relocation.
The original pr had fallen behind the main branch, making it inconvenient for maintainers @mxyng @jmorganca @dhiltgen to review. Therefore, I rebased the latest main branch and opened this new pull request. I have verified that it works correctly on Ubuntu 22.04 with ARC 770 GPU.
While I am not very familiar with this project and I welcome any guidance and assistance from the community. Let’s work together to make ollama support Intel GPU platforms. cc:@hshen14 @kevinintel @airMeng
UPDATE: works well on windows10 + ARC 770
UPDATE: works well on oneapi-docker-image(oneapi-basekit-Ubuntu22.04) + ARC770