Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans #10845

Closed
raj-ritu17 opened this issue Apr 22, 2024 · 10 comments
Assignees

Comments

@raj-ritu17
Copy link

platform: Flex 170
already, updated the ipex-llm[cpp]
Scenario:

  1. set the parameters
export OLLAMA_NUM_GPU=999
export ZES_ENABLE_SYSMAN=1
export PARAMETER use_mmap false
source /opt/intel/oneapi/setvars.sh
  1. started the server
    OLLAMA_HOST=0.0.0.0 ./ollama serve

  2. started a very simple script to use ollama, with chainlit (with langchain), script content:

from langchain_community.llms import Ollama
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain.schema.runnable import Runnable
from langchain.schema.runnable.config import RunnableConfig

import chainlit as cl

#load_dotenv()

@cl.on_chat_start
async def on_chat_start():

    # Sending an image with the local file path
    elements = [
    cl.Image(name="image1", display="inline", path="gemma.jpeg")
    ]
    await cl.Message(content="Hello there, I am Gemma. How can I help you ?", elements=elements).send()
    #model = Ollama(model="gemma:2b")
    model = Ollama(model="llama2:new")
    #model = Ollama(model="test:latest")
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You're a very knowledgeable historian who provides accurate and eloquent answers to historical questions.",
            ),
            ("human", "{question}"),
        ]
    )
    runnable = prompt | model | StrOutputParser()
    cl.user_session.set("runnable", runnable)


@cl.on_message
async def on_message(message: cl.Message):
    runnable = cl.user_session.get("runnable")  # type: Runnable

    msg = cl.Message(content="")

    async for chunk in runnable.astream(
        {"question": message.content},
        config=RunnableConfig(callbacks=[cl.LangchainCallbackHandler()]),
    ):
        await msg.stream_token(chunk)

    await msg.send()

==================================
error from server side:

found 3 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|            Intel(R) Data Center GPU Flex 170|       1.3|        512|    1024|     32|    14193102848|
| 1|    [opencl:cpu:0]|               INTEL(R) XEON(R) PLATINUM 8580|       3.0|        240|    8192|     64|    67113893888|
| 2|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|        240|67108864|     64|    67113893888|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3577.56 MiB
llm_load_tensors:  SYCL_Host buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.14 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   328.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1062
llama_new_context_with_model: graph splits = 2
oneapi::mkl::oneapi::mkl::blas::gemm: cannot allocate memory on host
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:15845, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15845
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"
time=2024-04-22T14:35:09.642+02:00 level=ERROR source=server.go:285 msg="error starting llama server" server=cpu_avx2 error="llama runner process no longer running: -1 error:CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!\n  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15845\nGGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !\"SYCL error\""
time=2024-04-22T14:35:09.642+02:00 level=ERROR source=server.go:293 msg="unable to load any llama server" error="llama runner process no longer running: -1 error:CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!\n  in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:15845\nGGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !\"SYCL error\""
[GIN] 2024/04/22 - 14:35:09 | 500 | 14.996665381s |       127.0.0.1 | POST     "/api/generate"

client side:

2024-04-22 14:34:36 - Your app is available at http://localhost:8000
gio: http://localhost:8000: 
2024-04-22 14:34:49 - Translated markdown file for en-US not found. Defaulting to chainlit.md.
2024-04-22 14:35:09 - Ollama call failed with status code 500. Details: <bound method ClientResponse.text of <ClientResponse(http://localhost:11434/api/generate) [500 Internal Server Error]>
<CIMultiDictProxy('Content-Type': 'application/json; charset=utf-8', 'Date': 'Mon, 22 Apr 2024 12:35:09 GMT', 'Content-Length': '764')>
>
...
...
...
ValueError: Ollama call failed with status code 500. Details: <bound method ClientResponse.text of <ClientResponse(http://localhost:11434/api/generate) [500 Internal Server Error]>
<CIMultiDictProxy('Content-Type': 'application/json; charset=utf-8', 'Date': 'Mon, 22 Apr 2024 12:35:09 GMT', 'Content-Length': '764')>
>
@rnwang04
Copy link
Contributor

Hi @raj-ritu17 ,
Would you mind sharing the result of pip list in your current conda env to help us better locate this issue ?

@raj-ritu17
Copy link
Author

@rnwang04, here is the list

(llama) intel@IMU-NEX-EMR1-SUT:~/ritu/ollama$ pip list
Package                     Version
--------------------------- ------------------
accelerate                  0.21.0
aiohttp                     3.9.3
aiosignal                   1.3.1
altair                      5.2.0
annotated-types             0.6.0
anyio                       4.3.0
async-timeout               4.0.3
attrs                       23.2.0
backoff                     2.2.1
bigdl-core-cpp              2.5.0b20240421
bigdl-core-xe-21            2.5.0b20240416
bigdl-core-xe-esimd-21      2.5.0b20240416
bigdl-llm                   2.5.0b20240416
blinker                     1.7.0
cachetools                  5.3.3
certifi                     2024.2.2
charset-normalizer          3.3.2
chromadb                    0.3.25
click                       8.1.7
clickhouse-connect          0.7.3
coloredlogs                 15.0.1
dataclasses-json            0.5.14
diskcache                   5.6.3
distro                      1.9.0
dpcpp-cpp-rt                2024.0.2
duckdb                      0.10.1
exceptiongroup              1.2.0
fastapi                     0.110.0
filelock                    3.13.1
flatbuffers                 24.3.7
frozenlist                  1.4.1
fsspec                      2024.3.0
gguf                        0.6.0
gitdb                       4.0.11
GitPython                   3.1.42
greenlet                    3.0.3
h11                         0.14.0
hnswlib                     0.8.0
httpcore                    1.0.4
httptools                   0.6.1
httpx                       0.25.2
huggingface-hub             0.21.4
humanfriendly               10.0
idna                        3.6
influxdb-client             1.41.0
intel-cmplr-lib-rt          2024.0.2
intel-cmplr-lic-rt          2024.0.2
intel-extension-for-pytorch 2.1.10+xpu
intel-opencl-rt             2024.0.2
intel-openmp                2024.0.2
ipex-llm                    2.1.0b20240421
Jinja2                      3.1.3
jsonpatch                   1.33
jsonpointer                 2.4
jsonschema                  4.21.1
jsonschema-specifications   2023.12.1
langchain                   0.1.12
langchain-community         0.0.28
langchain-core              0.1.32
langchain-text-splitters    0.0.1
langsmith                   0.1.27
llama_cpp_python            0.2.56
lz4                         4.3.3
markdown-it-py              3.0.0
MarkupSafe                  2.1.5
marshmallow                 3.21.1
mdurl                       0.1.2
mkl                         2024.0.0
mkl-dpcpp                   2024.0.0
monotonic                   1.6
mpmath                      1.3.0
multidict                   6.0.5
mypy-extensions             1.0.0
networkx                    3.2.1
numexpr                     2.9.0
numpy                       1.26.4
ollama                      0.1.7
onednn                      2024.0.0
onemkl-sycl-blas            2024.0.0
onemkl-sycl-datafitting     2024.0.0
onemkl-sycl-dft             2024.0.0
onemkl-sycl-lapack          2024.0.0
onemkl-sycl-rng             2024.0.0
onemkl-sycl-sparse          2024.0.0
onemkl-sycl-stats           2024.0.0
onemkl-sycl-vm              2024.0.0
onnxruntime                 1.17.1
openai                      1.14.1
openapi-schema-pydantic     1.2.4
orjson                      3.9.15
overrides                   7.7.0
packaging                   23.2
pandas                      2.0.3
pillow                      10.2.0
pip                         23.3.1
posthog                     3.5.0
protobuf                    4.25.3
psutil                      5.9.8
py-cpuinfo                  9.0.0
pyarrow                     15.0.1
pydantic                    1.10.14
pydantic_core               2.16.3
pydeck                      0.8.1b0
Pygments                    2.17.2
python-dateutil             2.9.0.post0
python-dotenv               1.0.1
pytz                        2024.1
PyYAML                      6.0.1
reactivex                   4.0.4
referencing                 0.34.0
regex                       2023.12.25
requests                    2.31.0
rich                        13.7.1
rpds-py                     0.18.0
safetensors                 0.4.2
sentencepiece               0.2.0
setuptools                  68.2.2
six                         1.16.0
smmap                       5.0.1
sniffio                     1.3.1
SQLAlchemy                  2.0.28
starlette                   0.36.3
streamlit                   1.32.2
sympy                       1.12
tabulate                    0.9.0
tbb                         2021.11.0
tenacity                    8.2.3
tokenizers                  0.15.2
toml                        0.10.2
toolz                       0.12.1
torch                       2.1.0a0+cxx11.abi
torchvision                 0.16.0a0+cxx11.abi
tornado                     6.4
tqdm                        4.66.2
transformers                4.36.0
typing_extensions           4.10.0
typing-inspect              0.9.0
tzdata                      2024.1
urllib3                     2.2.1
uvicorn                     0.29.0
uvloop                      0.19.0
watchdog                    4.0.0
watchfiles                  0.21.0
websockets                  12.0
wheel                       0.41.2
yarl                        1.9.4
zstandard                   0.22.0

@rnwang04
Copy link
Contributor

Hi @raj-ritu17 ,
I guess this issue is caused by :

onednn                      2024.0.0
onemkl-sycl-blas            2024.0.0
onemkl-sycl-datafitting     2024.0.0
onemkl-sycl-dft             2024.0.0
onemkl-sycl-lapack          2024.0.0
onemkl-sycl-rng             2024.0.0
onemkl-sycl-sparse          2024.0.0
onemkl-sycl-stats           2024.0.0
onemkl-sycl-vm              2024.0.0

If your want to use llama.cpp or ollama on Linux system, DO NOT USE pip to install oneapi like this pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0.
Would you mind creating a new conda env without pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0 and try ollama again ?

@raj-ritu17
Copy link
Author

raj-ritu17 commented Apr 22, 2024

thanks for the update, don't we need those for one-api ?

I was following this document:
https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html

@rnwang04
Copy link
Contributor

don't we need those for one-api ?
I was following this document: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html

Yes, in our linux installation guide, we recommend using APT to install oneapi : https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-oneapi.

You don't need to install oneapi in your conda env again by pip install any oneapi related package.

So I suggest that you can create a new conda env and do not use pip to install any oneapi related package to see whether this issue can be solved.

@raj-ritu17
Copy link
Author

@rnwang04, much appreciated
this is working fine.

@raj-ritu17
Copy link
Author

@rnwang04 unfortunately, it doesn't work anymore.
it is not stable. I pulled a new model and tried to run but its just hang.

for example (llama3:

intel@IMU-NEX-EMR1-SUT:~/ritu/ollama$ ./ollama run  gemma
pulling manifest
pulling ef311de6af9d... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 5.0 GB
pulling 097a36493f71... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 8.4 KB
pulling 109037bec39c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  136 B
pulling 65bb16cf5983... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  109 B
pulling 0c2a5137eb3c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  483 B
verifying sha256 digest
writing manifest
removing any unused layers
success
⠸
⠦
⠇
⠧
⠏
⠙
⠏
⠙
Error: timed out waiting for llama runner to start:

and server side:

found 4 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|            Intel(R) Data Center GPU Flex 170|       1.3|        512|    1024|     32|    14193102848|
| 1|    [opencl:gpu:0]|            Intel(R) Data Center GPU Flex 170|       3.0|        512|    1024|     32|    14193102848|
| 2|    [opencl:cpu:0]|               INTEL(R) XEON(R) PLATINUM 8580|       3.0|        240|    8192|     64|    67113893888|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|        240|67108864|     64|    67113893888|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
llm_load_tensors: ggml ctx size =    0.19 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  4773.90 MiB
llm_load_tensors:        CPU buffer size =   615.23 MiB
[GIN] 2024/04/23 - 13:37:01 | 200 |       54.86µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/04/23 - 13:37:01 | 200 |     3.36308ms |       127.0.0.1 | GET      "/api/tags"



time=2024-04-23T13:46:43.762+02:00 level=ERROR source=server.go:285 msg="error starting llama server" server=cpu_avx2 error="timed out waiting for llama runner to start: "
time=2024-04-23T13:46:43.763+02:00 level=ERROR source=server.go:293 msg="unable to load any llama server" error="timed out waiting for llama runner to start: "
[GIN] 2024/04/23 - 13:46:43 | 500 |         10m1s |       127.0.0.1 | POST     "/api/chat"

and I have tried llama2 also, but sometime having different issue.

Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:17037, func:operator()
SYCL error: CHECK_TRY_ERROR((*stream) .memcpy((char *)tensor->data + offset, data, size) .wait()): Meet error in this line code!
  in function ggml_backend_sycl_buffer_set_tensor at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:17037
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"
time=2024-04-23T14:09:56.097+02:00 level=ERROR source=server.go:285 msg="error starting llama server" server=cpu_avx2 error="llama runner process no longer running: -1 error:CHECK_TRY_ERROR((*stream) .memcpy((char *)tensor->data + offset, data, size) .wait()): Meet error in this line code!\n  in function ggml_backend_sycl_buffer_set_tensor at /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:17037\nGGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !\"SYCL error\""
time=2024-04-23T14:09:56.097+02:00 level=ERROR source=server.go:293 msg="unable to load any llama server" error="llama runner process no longer running: -1 error:CHECK_TRY_ERROR((*stream) .memcpy((char *)tensor->data + offset, data, size) .wait()): Meet error in this line code!\n  in function ggml_backend_sycl_buffer_set_tensor

here is my pip list:

(llm-v2) intel@IMU-NEX-EMR1-SUT:~/ritu/ollama$ pip list
Package                     Version
--------------------------- ------------------
accelerate                  0.21.0
annotated-types             0.6.0
bigdl-core-xe-21            2.5.0b20240421
bigdl-core-xe-esimd-21      2.5.0b20240421
certifi                     2024.2.2
charset-normalizer          3.3.2
filelock                    3.13.4
fsspec                      2024.3.1
huggingface-hub             0.22.2
idna                        3.7
intel-extension-for-pytorch 2.1.10+xpu
intel-openmp                2024.1.0
ipex-llm                    2.1.0b20240421
Jinja2                      3.1.3
MarkupSafe                  2.1.5
mpmath                      1.3.0
networkx                    3.3
numpy                       1.26.4
packaging                   24.0
pillow                      10.3.0
pip                         23.3.1
protobuf                    5.27.0rc1
psutil                      5.9.8
py-cpuinfo                  9.0.0
pydantic                    2.7.0
pydantic_core               2.18.1
PyYAML                      6.0.1
regex                       2024.4.16
requests                    2.31.0
safetensors                 0.4.3
sentencepiece               0.2.0
setuptools                  68.2.2
sympy                       1.12.1rc1
tabulate                    0.9.0
tokenizers                  0.13.3
torch                       2.1.0a0+cxx11.abi
torchvision                 0.16.0a0+cxx11.abi
tqdm                        4.66.2
transformers                4.31.0
typing_extensions           4.11.0
urllib3                     2.2.1
wheel                       0.41.2

@rnwang04 rnwang04 reopened this Apr 23, 2024
@rnwang04 rnwang04 self-assigned this Apr 23, 2024
@rnwang04
Copy link
Contributor

rnwang04 commented Apr 23, 2024

Hi @raj-ritu17
I think these two actually are different issues from above oneapi issue.

For the first one, hang of gemma:

If your program hang after llm_load_tensors: CPU buffer size = xx.xx MiB, usually set use_mmap to false can solve it, you can refer to this: #10797 (comment) to see how to add PARAMETER use_mmap false for your model.

For the second, -5 (PI_ERROR_OUT_OF_RESOURCES)

-5 (PI_ERROR_OUT_OF_RESOURCES) means you are out of your GPU memory.
You can check your GPU memory by watch -t -n 1 "sudo xpu-smi stats -d 0 | grep \"GPU Memory Used\"" or any other monitoring tool.
I guess this is caused by you are loading several models in your VRAM at the same time.
You can check it by ./ollama list, and if there are several models, you can remove some of them by ./ollama rm xxxx

@raj-ritu17
Copy link
Author

@rnwang04
So, I have set the nmap and gpu parameter.
still hanging .. could be it a memory issue?

this was my memory utilization at the runtime:
image

@rnwang04
Copy link
Contributor

Hi @raj-ritu17
according to our offline sync, we could think that this problem has been solved ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants