You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Intel® Max Series GPU (PVC) on 4th Gen Intel® Xeon® processors – 1100 series (8x)
I got these logs:
Test Command and Logs
$ $ ZE_ENABLE_LOADER_DEBUG_TRACE=1 SYCL_CACHE_PERSISTENT=1 ./main --n-gpu-layers 999 --n-predict 1024 --model ~/share/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf --ctx-size 4096 --ignore-eos --split-mode none --main-gpu 0 -f ~/opt/src prompt.txtLog startmain: build = 1 (baa5868)main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnumain: seed = 1716242491llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/sdp/share/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.llama_model_loader: - kv 0: general.architecture str = llamallama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.1llama_model_loader: - kv 2: llama.context_length u32 = 32768llama_model_loader: - kv 3: llama.embedding_length u32 = 4096llama_model_loader: - kv 4: llama.block_count u32 = 32llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128llama_model_loader: - kv 7: llama.attention.head_count u32 = 32llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000llama_model_loader: - kv 11: general.file_type u32 = 15llama_model_loader: - kv 12: tokenizer.ggml.model str = llamallama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0llama_model_loader: - kv 19: general.quantization_version u32 = 2llama_model_loader: - type f32: 65 tensorsllama_model_loader: - type q4_K: 193 tensorsllama_model_loader: - type q6_K: 33 tensorsllm_load_vocab: special tokens definition check successful ( 259/32000 ).llm_load_print_meta: format = GGUF V2llm_load_print_meta: arch = llamallm_load_print_meta: vocab type = SPMllm_load_print_meta: n_vocab = 32000llm_load_print_meta: n_merges = 0llm_load_print_meta: n_ctx_train = 32768llm_load_print_meta: n_embd = 4096llm_load_print_meta: n_head = 32llm_load_print_meta: n_head_kv = 8llm_load_print_meta: n_layer = 32llm_load_print_meta: n_rot = 128llm_load_print_meta: n_embd_head_k = 128llm_load_print_meta: n_embd_head_v = 128llm_load_print_meta: n_gqa = 4llm_load_print_meta: n_embd_k_gqa = 1024llm_load_print_meta: n_embd_v_gqa = 1024llm_load_print_meta: f_norm_eps = 0.0e+00llm_load_print_meta: f_norm_rms_eps = 1.0e-05llm_load_print_meta: f_clamp_kqv = 0.0e+00llm_load_print_meta: f_max_alibi_bias = 0.0e+00llm_load_print_meta: f_logit_scale = 0.0e+00llm_load_print_meta: n_ff = 14336llm_load_print_meta: n_expert = 0llm_load_print_meta: n_expert_used = 0llm_load_print_meta: causal attn = 1llm_load_print_meta: pooling type = 0llm_load_print_meta: rope type = 0llm_load_print_meta: rope scaling = linearllm_load_print_meta: freq_base_train = 10000.0llm_load_print_meta: freq_scale_train = 1llm_load_print_meta: n_yarn_orig_ctx = 32768llm_load_print_meta: rope_finetuned = unknownllm_load_print_meta: ssm_d_conv = 0llm_load_print_meta: ssm_d_inner = 0llm_load_print_meta: ssm_d_state = 0llm_load_print_meta: ssm_dt_rank = 0llm_load_print_meta: model type = 8Bllm_load_print_meta: model ftype = Q4_K - Mediumllm_load_print_meta: model params = 7.24 Bllm_load_print_meta: model size = 4.07 GiB (4.83 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.1llm_load_print_meta: BOS token = 1 '<s>'llm_load_print_meta: EOS token = 2 '</s>'llm_load_print_meta: UNK token = 0 '<unk>'llm_load_print_meta: LF token = 13 '<0x0A>'[SYCL] call ggml_init_syclggml_init_sycl: GGML_SYCL_DEBUG: 0ggml_init_sycl: GGML_SYCL_F16: noZE_LOADER_DEBUG_TRACE:Load Library of libze_intel_vpu.so.1 failed with libze_intel_vpu.so.1: cannot open shared object file: No such file or directoryfound 18 SYCL devices:| | | | |Max | |Max |Global | || | | | |compute|Max work|sub |mem | ||ID| Device Type| Name|Version|units |group |group|size | Driver version||--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|| 0| [level_zero:gpu:0]| Intel Data Center GPU Max 1100| 1.3| 448| 1024| 32| 51539M| 1.3.27191|| 1| [level_zero:gpu:1]| Intel Data Center GPU Max 1100| 1.3| 448| 1024| 32| 51539M| 1.3.27191|| 2| [level_zero:gpu:2]| Intel Data Center GPU Max 1100| 1.3| 448| 1024| 32| 51539M| 1.3.27191|| 3| [level_zero:gpu:3]| Intel Data Center GPU Max 1100| 1.3| 448| 1024| 32| 51539M| 1.3.27191|| 4| [level_zero:gpu:4]| Intel Data Center GPU Max 1100| 1.3| 448| 1024| 32| 51539M| 1.3.27191|| 5| [level_zero:gpu:5]| Intel Data Center GPU Max 1100| 1.3| 448| 1024| 32| 51539M| 1.3.27191|| 6| [level_zero:gpu:6]| Intel Data Center GPU Max 1100| 1.3| 448| 1024| 32| 51539M| 1.3.27191|| 7| [level_zero:gpu:7]| Intel Data Center GPU Max 1100| 1.3| 448| 1024| 32| 51539M| 1.3.27191|| 8| [opencl:gpu:0]| Intel Data Center GPU Max 1100| 3.0| 448| 1024| 32| 48946M| 23.35.27191.42|| 9| [opencl:gpu:1]| Intel Data Center GPU Max 1100| 3.0| 448| 1024| 32| 48946M| 23.35.27191.42||10| [opencl:gpu:2]| Intel Data Center GPU Max 1100| 3.0| 448| 1024| 32| 48946M| 23.35.27191.42||11| [opencl:gpu:3]| Intel Data Center GPU Max 1100| 3.0| 448| 1024| 32| 48946M| 23.35.27191.42||12| [opencl:gpu:4]| Intel Data Center GPU Max 1100| 3.0| 448| 1024| 32| 48946M| 23.35.27191.42||13| [opencl:gpu:5]| Intel Data Center GPU Max 1100| 3.0| 448| 1024| 32| 48946M| 23.35.27191.42||14| [opencl:gpu:6]| Intel Data Center GPU Max 1100| 3.0| 448| 1024| 32| 48946M| 23.35.27191.42||15| [opencl:gpu:7]| Intel Data Center GPU Max 1100| 3.0| 448| 1024| 32| 48946M| 23.35.27191.42||16| [opencl:cpu:0]| Intel Xeon Platinum 8468V| 3.0| 192| 8192| 64|1081858M|2024.17.3.0.08_160000||17| [opencl:acc:0]| Intel FPGA Emulation Device| 1.2| 192|67108864| 64|1081858M|2024.17.3.0.08_160000|ggml_backend_sycl_set_single_device: use single device: [0]use 1 SYCL GPUs: [0] with Max compute units:448llm_load_tensors: ggml ctx size = 0.30 MiBllm_load_tensors: offloading 32 repeating layers to GPUllm_load_tensors: offloading non-repeating layers to GPUllm_load_tensors: offloaded 33/33 layers to GPUllm_load_tensors: SYCL0 buffer size = 4095.05 MiBllm_load_tensors: CPU buffer size = 70.31 MiB..............................................................................................llama_new_context_with_model: n_ctx = 4096llama_new_context_with_model: n_batch = 2048llama_new_context_with_model: n_ubatch = 512llama_new_context_with_model: flash_attn = 0llama_new_context_with_model: freq_base = 10000.0llama_new_context_with_model: freq_scale = 1llama_kv_cache_init: SYCL0 KV buffer size = 512.00 MiBllama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiBllama_new_context_with_model: SYCL_Host output buffer size = 0.12 MiBllama_new_context_with_model: SYCL0 compute buffer size = 296.00 MiBllama_new_context_with_model: SYCL_Host compute buffer size = 16.01 MiBllama_new_context_with_model: graph nodes = 1062llama_new_context_with_model: graph splits = 2Sub-group size 8 is not supported on the deviceException caught at file:/home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp, line:15352, func:operator()SYCL error: CHECK_TRY_ERROR(op(src0, src1, dst, src0_dd_i, src1_ddf_i, src1_ddq_i, dst_dd_i, dev[i].row_low, dev[i].row_high, src1_ncols, src1_padded_col_size, stream)): Meet error in this line code! in function ggml_sycl_op_mul_mat at /home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp:15352GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp:3021: !"SYCL error"Could not attach to process. If your uid matches the uid of the targetprocess, check the setting of /proc/sys/kernel/yama/ptrace_scope, or tryagain as the root user. For more details, see /etc/sysctl.d/10-ptrace.confptrace: Inappropriate ioctl for device.No stack.The program is not being run.
I suspect the problem is related to the fact that I'm using a machine with 8 GPUs (given the log statement about a "sub-group size 8 is not supported on the device").
I was able to successfully compile and run the llama.cpp source code myself with no issues, so I believe the problem is related to how exactly the IPEX-LLM version of llama.cpp was compiled.
(P.S. When I compiled it myself, I used the 2024.0 version of the oneAPI compiler package)
Any help would be greatly appreciated!
The text was updated successfully, but these errors were encountered:
Hi @player1537 ,
I have fixed this issue, maybe you can try it again with ipex-llm[cpp] >= 2.1.0b20240521 (which will be released tonight).
By the way, if you have no special requirements for accuracy, we recommend you use Q4_0, which provides the fastest speed on PVC : )
I followed the instructions from https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html on a bare metal server from the Intel Dev Cloud, specifically this instance:
I got these logs:
Test Command and Logs
I suspect the problem is related to the fact that I'm using a machine with 8 GPUs (given the log statement about a "sub-group size 8 is not supported on the device").
I was able to successfully compile and run the llama.cpp source code myself with no issues, so I believe the problem is related to how exactly the IPEX-LLM version of llama.cpp was compiled.
(P.S. When I compiled it myself, I used the 2024.0 version of the oneAPI compiler package)
Any help would be greatly appreciated!
The text was updated successfully, but these errors were encountered: