Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipex-llm[cpp] error: Sub-group size 8 is not supported on the device #11080

Closed
player1537 opened this issue May 20, 2024 · 4 comments
Closed
Assignees

Comments

@player1537
Copy link

player1537 commented May 20, 2024

I followed the instructions from https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html on a bare metal server from the Intel Dev Cloud, specifically this instance:

Intel® Max Series GPU (PVC) on 4th Gen Intel® Xeon® processors – 1100 series (8x)

I got these logs:

Test Command and Logs
$ $ ZE_ENABLE_LOADER_DEBUG_TRACE=1 SYCL_CACHE_PERSISTENT=1 ./main --n-gpu-layers 999 --n-predict 1024 --model ~/share/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf --ctx-size 4096 --ignore-eos --split-mode none --main-gpu 0 -f ~/opt/src prompt.txt
Log start
main: build = 1 (baa5868)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnu
main: seed  = 1716242491
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/sdp/share/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
ZE_LOADER_DEBUG_TRACE:Load Library of libze_intel_vpu.so.1 failed with libze_intel_vpu.so.1: cannot open shared object file: No such file or directory
found 18 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|         Intel Data Center GPU Max 1100|    1.3|    448|    1024|   32| 51539M|            1.3.27191|
| 1| [level_zero:gpu:1]|         Intel Data Center GPU Max 1100|    1.3|    448|    1024|   32| 51539M|            1.3.27191|
| 2| [level_zero:gpu:2]|         Intel Data Center GPU Max 1100|    1.3|    448|    1024|   32| 51539M|            1.3.27191|
| 3| [level_zero:gpu:3]|         Intel Data Center GPU Max 1100|    1.3|    448|    1024|   32| 51539M|            1.3.27191|
| 4| [level_zero:gpu:4]|         Intel Data Center GPU Max 1100|    1.3|    448|    1024|   32| 51539M|            1.3.27191|
| 5| [level_zero:gpu:5]|         Intel Data Center GPU Max 1100|    1.3|    448|    1024|   32| 51539M|            1.3.27191|
| 6| [level_zero:gpu:6]|         Intel Data Center GPU Max 1100|    1.3|    448|    1024|   32| 51539M|            1.3.27191|
| 7| [level_zero:gpu:7]|         Intel Data Center GPU Max 1100|    1.3|    448|    1024|   32| 51539M|            1.3.27191|
| 8|     [opencl:gpu:0]|         Intel Data Center GPU Max 1100|    3.0|    448|    1024|   32| 48946M|       23.35.27191.42|
| 9|     [opencl:gpu:1]|         Intel Data Center GPU Max 1100|    3.0|    448|    1024|   32| 48946M|       23.35.27191.42|
|10|     [opencl:gpu:2]|         Intel Data Center GPU Max 1100|    3.0|    448|    1024|   32| 48946M|       23.35.27191.42|
|11|     [opencl:gpu:3]|         Intel Data Center GPU Max 1100|    3.0|    448|    1024|   32| 48946M|       23.35.27191.42|
|12|     [opencl:gpu:4]|         Intel Data Center GPU Max 1100|    3.0|    448|    1024|   32| 48946M|       23.35.27191.42|
|13|     [opencl:gpu:5]|         Intel Data Center GPU Max 1100|    3.0|    448|    1024|   32| 48946M|       23.35.27191.42|
|14|     [opencl:gpu:6]|         Intel Data Center GPU Max 1100|    3.0|    448|    1024|   32| 48946M|       23.35.27191.42|
|15|     [opencl:gpu:7]|         Intel Data Center GPU Max 1100|    3.0|    448|    1024|   32| 48946M|       23.35.27191.42|
|16|     [opencl:cpu:0]|              Intel Xeon Platinum 8468V|    3.0|    192|    8192|   64|1081858M|2024.17.3.0.08_160000|
|17|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|    192|67108864|   64|1081858M|2024.17.3.0.08_160000|
ggml_backend_sycl_set_single_device: use single device: [0]
use 1 SYCL GPUs: [0] with Max compute units:448
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  4095.05 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..............................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1062
llama_new_context_with_model: graph splits = 2
Sub-group size 8 is not supported on the device
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp, line:15352, func:operator()
SYCL error: CHECK_TRY_ERROR(op(src0, src1, dst, src0_dd_i, src1_ddf_i, src1_ddq_i, dst_dd_i, dev[i].row_low, dev[i].row_high, src1_ncols, src1_padded_col_size, stream)): Meet error in this line code!
  in function ggml_sycl_op_mul_mat at /home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp:15352
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp:3021: !"SYCL error"
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.

I suspect the problem is related to the fact that I'm using a machine with 8 GPUs (given the log statement about a "sub-group size 8 is not supported on the device").

I was able to successfully compile and run the llama.cpp source code myself with no issues, so I believe the problem is related to how exactly the IPEX-LLM version of llama.cpp was compiled.

(P.S. When I compiled it myself, I used the 2024.0 version of the oneAPI compiler package)

Any help would be greatly appreciated!

@rnwang04 rnwang04 self-assigned this May 21, 2024
@rnwang04
Copy link
Contributor

Hi @player1537 , I have reproduced this error.
We will try to fix it, and once it is done, will update here to let you know : )

@rnwang04
Copy link
Contributor

Hi @player1537 ,
I have fixed this issue, maybe you can try it again with ipex-llm[cpp] >= 2.1.0b20240521 (which will be released tonight).
By the way, if you have no special requirements for accuracy, we recommend you use Q4_0, which provides the fastest speed on PVC : )

@player1537
Copy link
Author

Thank you for the quick update!

I just wanted to confirm that I was able to get IPEX-LLM working now on the aforementioned machine.

We're seeing ~2.2x the inference token/s with IPEX-LLM's llama.cpp as we got with my personally compiled llama.cpp! :)

@soulyet
Copy link

soulyet commented Aug 26, 2024

but we met same issue with IPEX-LLM v2.1.0b20240819. is there something wrong with my environment?

GPU: Intel Arc GPU

logs:
llama.cpp_log.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants