MTL GPU driver not shown and GPU demo crashed on Linux #11460

lucshi · 2024-06-28T07:00:21Z

HW: MTL with ARC iGPU
OS: Ubuntu 22.04
Kernel: 6.5.0-41-generic
Ref: https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md
Problem1: cannot find GPU driver by sycl-ls.
Problem2: demo.py crashed. Log is attached.
`
intel-fw-gpu is already the newest version (2024.17.5-329~22.04).

intel-i915-dkms is already the newest version (1.24.2.17.240301.20+i29-1).

(llm) sdp@9049fa09fdbc:~$ source /opt/intel/oneapi/setvars.sh --force

:: initializing oneAPI environment ...
-bash: BASH_VERSION = 5.1.16(1)-release
args: Using "$@" for setvars.sh arguments: --force
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

(llm) sdp@9049fa09fdbc:~$ sycl-ls

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000]

[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 1003H OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
`

Demo crash:
`[2:04 PM] Shi, Lei A
(llm) sdp@9049fa09fdbc:~$ python demo.py

/home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?

warn(

2024-06-27 22:51:53,784 - INFO - intel_extension_for_pytorch auto imported

/home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.

warnings.warn(

2024-06-27 22:51:54,304 - WARNING -

WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.95s/it]

2024-06-27 22:52:04,476 - INFO - Converting the current model to sym_int4 format......

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)

LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 1003H]

Registry and code: 13 MB

Command: python demo.py

Uptime: 17.979020 s

Segmentation fault (core dumped)

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.25s/it]

2024-06-27 22:54:34,472 - INFO - Converting the current model to sym_int4 format......

[Detaching after vfork from child process 23148]

[New Thread 0x7fffb6fee640 (LWP 23152)]

[New Thread 0x7fffd57fb640 (LWP 23153)]

[New Thread 0x7fffd2ffa640 (LWP 23154)]

[New Thread 0x7fffd07f9640 (LWP 23155)]

[New Thread 0x7fffcdff8640 (LWP 23156)]

[New Thread 0x7fffcb7f7640 (LWP 23157)]

[New Thread 0x7fffc8ff6640 (LWP 23158)]

[New Thread 0x7fffc67f5640 (LWP 23159)]

[New Thread 0x7fffc3ff4640 (LWP 23160)]

[New Thread 0x7fffc17f3640 (LWP 23161)]

[New Thread 0x7fffbeff2640 (LWP 23162)]

[New Thread 0x7fffbe7f1640 (LWP 23163)]

[New Thread 0x7fffb9ff0640 (LWP 23164)]

[New Thread 0x7fffb77ef640 (LWP 23165)]

[New Thread 0x7ffecdf53640 (LWP 23166)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.

0x00007fff005f16ab in xpu::dpcpp::initGlobalDevicePoolState() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

(gdb) bt

#0 0x00007fff005f16ab in xpu::dpcpp::initGlobalDevicePoolState() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

#1 0x00007ffff7c99ee8 in __pthread_once_slow (once_control=0x7fff13cbddd8 xpu::dpcpp::init_device_flag, init_routine=0x7fffe0cdad50 <__once_proxy>) at ./nptl/pthread_once.c:116

#2 0x00007fff005ee491 in xpu::dpcpp::dpcppGetDeviceCount(int*) () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

#3 0x00007fff005a8c52 in xpu::dpcpp::device_count()::{lambda()#1}::operator()() const ()

from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

#4 0x00007fff005a8c18 in xpu::dpcpp::device_count() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

#5 0x00007fffa23be0c8 in xpu::THPModule_initExtension(_object*, _object*) ()

from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-python.so

#6 0x000055555573950e in cfunction_vectorcall_NOARGS (func=0x7fffa2410c20, args=, nargsf=, kwnames=)

at /usr/local/src/conda/python-3.11.9/Include/cpython/methodobject.h:52

#7 0x000055555574eeac in _PyObject_VectorcallTstate (kwnames=, nargsf=, args=, callable=0x7fffa2410c20,

tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92

#8 PyObject_Vectorcall (callable=0x7fffa2410c20, args=, nargsf=, kwnames=) at /usr/local/src/conda/python-3.11.9/Objects/call.c:299

#9 0x00005555557423b6 in _PyEval_EvalFrameDefault (tstate=, frame=, throwflag=) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:4769

#10 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7fb07d0, tstate=0x555555ad0998 <_PyRuntime+166328>)

at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73

#11 _PyEval_Vector (kwnames=, argcount=0, args=0x0, locals=0x0, func=, tstate=0x555555ad0998 <_PyRuntime+166328>)

at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434

#12 _PyFunction_Vectorcall (func=, stack=0x0, nargsf=, kwnames=) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393

#13 0x0000555555730244 in _PyObject_VectorcallTstate (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7ffee5567380, args=, nargsf=,

kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92

#14 0x00005555557fef1c in PyObject_CallMethod (obj=, name=, format=0x7fffa23d7aea "") at /usr/local/src/conda/python-3.11.9/Objects/call.c:627

#15 0x00007fffa23bb48d in xpu::lazy_init() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-python.so

#16 0x00007fff005a8d86 in xpu::dpcpp::current_device() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

#17 0x00007fff005ad5b6 in xpu::dpcpp::impl::DPCPPGuardImpl::getDevice() const ()

from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

#18 0x00007fffe29b274f in at::native::to(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, bool, c10::optionalc10::MemoryFormat) () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so

#19 0x00007fffe37c3743 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, bool, c10::optionalc10::MemoryFormat), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd_dtype_layout_to>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, bool, c10::optionalc10::MemoryFormat > >, at::Tensor (at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, bool, c10::optionalc10::MemoryFormat)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, bool, c10::optionalc10::MemoryFormat) ()

from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so

#20 0x00007fffe3049eea in at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, bool, c10::optionalc10::MemoryFormat) () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so

#21 0x00007fffef1dfa19 in torch::autograd::dispatch_to(at::Tensor const&, c10::Device, bool, bool, c10::optionalc10::MemoryFormat) ()

from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_python.so

#22 0x00007fffef24a8ec in torch::autograd::THPVariable_to(_object*, _object*, _object*) () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_python.so

#23 0x000055555575f1c8 in method_vectorcall_VARARGS_KEYWORDS (func=0x7ffff7104360, args=0x7ffff7fb07a8, nargsf=, kwnames=)

at /usr/local/src/conda/python-3.11.9/Objects/descrobject.c:364

#24 0x000055555574eeac in _PyObject_VectorcallTstate (kwnames=, nargsf=, args=, callable=0x7ffff7104360,

tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92

#25 PyObject_Vectorcall (callable=0x7ffff7104360, args=, nargsf=, kwnames=) at /usr/local/src/conda/python-3.11.9/Objects/call.c:299

#26 0x00005555557423b6 in _PyEval_EvalFrameDefault (tstate=, frame=, throwflag=) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:4769

#27 0x0000555555783fc2 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7fb0140, tstate=0x555555ad0998 <_PyRuntime+166328>)

at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73

#28 _PyEval_Vector (kwnames=, argcount=, args=0x7fffffffc7a0, locals=0x0, func=0x7fffa85d6c00, tstate=0x555555ad0998 <_PyRuntime+166328>)

at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434

#29 _PyFunction_Vectorcall (kwnames=, nargsf=, stack=0x7fffffffc7a0, func=0x7fffa85d6c00) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393

#30 _PyObject_VectorcallTstate (kwnames=, nargsf=, args=0x7fffffffc7a0, callable=0x7fffa85d6c00, tstate=0x555555ad0998 <_PyRuntime+166328>)

at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92

#31 method_vectorcall (method=, args=, nargsf=, kwnames=) at /usr/local/src/conda/python-3.11.9/Objects/classobject.c:89

--Type for more, q to quit, c to continue without paging--

`

After reinstall level-zero. Crash changed to "killed"

The text was updated successfully, but these errors were encountered:

lucshi · 2024-07-02T03:24:40Z

root cause has been identified by Qiu,Xin that the driver is not properly installed. But the MTL is too new and currently no good way to install the driver.
After changing to Ubuntu 24 with kernel 6.8, the kernel driver seems to be installed, but sycl-ls still cannot show oneapi xxxx item.

qiuxin2012 self-assigned this Jun 28, 2024

jason-dai added the user issue label Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MTL GPU driver not shown and GPU demo crashed on Linux #11460

MTL GPU driver not shown and GPU demo crashed on Linux #11460

lucshi commented Jun 28, 2024

lucshi commented Jul 2, 2024

MTL GPU driver not shown and GPU demo crashed on Linux #11460

MTL GPU driver not shown and GPU demo crashed on Linux #11460

Comments

lucshi commented Jun 28, 2024

lucshi commented Jul 2, 2024