NEO driver not detect GPU when using kernel 6.8.x. #710

ionutnechita-intel · 2024-02-27T10:50:23Z

NEO driver is not detect for GPU when using kernel 6.8.x.

When have kernel 6.5.x and 6.6.x this is present.

/opt/intel/oneapi/compiler/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:2] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [24.05.28454.6]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28454]

And on kernel 6.8.x have this:

/opt/intel/oneapi/compiler/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:2] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]

The text was updated successfully, but these errors were encountered:

eero-t · 2024-02-28T19:09:49Z

I can reproduce this with latest drm-tip 6.8.0-rc6 kernel, using earlier built (2024-02-09) compute-runtime master branch, or earlier compute-runtime releases => Neither clinfo nor zello_sysman recognizes the GPU. vainfo / vpl-inspect media tools still recognize the GPU though, so it's compute stack specific issue.

I do not see any difference in strace output (between old an new kernels) before compute-runtime decides to give up, so it's a bit mystery why it decides not to recognize the GPU.

ionutnechita-intel · 2024-02-29T08:21:44Z

Thank you for reproduced this.

On 6.7.x, GPU is recognized.
Only 6.8.x is not recognized.

eero-t · 2024-02-29T11:00:09Z

Yes, it works with 6.7 (drm-tip) kernel also for me, just not with 6.8 (i915 KMD).

EDIT: that was with public Xe KMD repo, not drm-tip. With drm-tip, the issue is already with earlier kernel version (see below).

ionutnechita-intel · 2024-02-29T16:19:34Z

I tested with 6.8.0-rc1(6.8.0-060800rc1-generic) and this issue is reproduced.

Maybe between 6.7 and 6.8.0-rc1 appear this issue.

I notice several commits with new Xe Intel driver and fixing eDP/DisplayPort in 6.8.0-rc1.

I not have time to bisect for detect what commit/commits cause this behaviour.

eero-t · 2024-02-29T18:39:08Z

Dang. I was comparing "drm-tip" on TGL against "xe-drm-next" kernel on DG1, but their i915 KMD codes seem to progress at different rates, so I had to do quick bisection using already existing nightly "drm-tip" builds...

While things work still with 6.7 version of "xe-drm-next" kernel repo, with the "drm-tip" repo kernel, clinfo & zello_sysman actually broke already earlier, somewhere between couple of "drm-tip" repo upstream 6.6-rc7 kernel integration changes:

drm-tip: 2023y-10m-29d-09h-52m-45s UTC integration manifest
drm-tip: 2023y-10m-31d-13h-47m-12s UTC integration manifest

(Commits named like those, or the original commits are not any more in "drm-tip" repo, as it gets constantly rebased to upstream, so I cannot provide list of commits between them any more.)

JablonskiMateusz · 2024-03-01T07:46:47Z

Hi folks,
we also observe issue with 6.8 kernel - i915 reports different I915_CONTEXT_PARAM_GTT_SIZE. As a workaround could you try to run application with additional env - NEOReadDebugKeys=1 OverrideGpuAddressSpace=48 ?

eero-t · 2024-03-01T08:59:39Z

we also observe issue with 6.8 kernel - i915 reports different I915_CONTEXT_PARAM_GTT_SIZE.

Media and 3D drivers seem to work fine with that change, why it's a problem for L0/compute stack?

(I'm wondering whether this change should be reported to upstream as kernel stable ABI breakage...)

Looking at the compute-runtime code, it seems to affect SVM capability & address space size:
https://github.com/intel/compute-runtime/blob/master/shared/source/os_interface/linux/product_helper_drm.cpp#L128

Where's in Mesa code:
https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/vulkan/anv_device.c#L2300

eero-t · 2024-03-01T09:37:21Z

As a workaround could you try to run application with additional env - NEOReadDebugKeys=1 OverrideGpuAddressSpace=48 ?

Yes, with those both clinfo & zello_sysman work just fine (on TGL-H iGPU).

ionutnechita-intel · 2024-03-01T10:13:52Z

Hi @eero-t,

Using latest drm-tip version with variable in environment, GPU appear.

# /opt/intel/oneapi/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:2] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
# NEOReadDebugKeys=1 OverrideGpuAddressSpace=48 /opt/intel/oneapi/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:1] Intel(R) OpenCL HD Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [23.13.026032]
[opencl:cpu:2] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:3] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.26032]
# uname -a
Linux 6.8.0-rc6-lowlatency1 #1 SMP PREEMPT_DYNAMIC Fri Mar  1 09:38:45 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
# lscpu | grep "Model name"
Model name:                         11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz

ionutnechita-intel · 2024-03-01T10:26:43Z

In this case issue is from Kernel or NEO driver/OpenCL?

eero-t · 2024-03-01T10:36:58Z

Well, it depends the GTT size value returned by the KMD is thought to be part of stable ABI, but I do not see how it could be, as there can be different reasons for those values to differ. I would think that NEO should accept / adapt to sensible GTT size values, potentially with a warning when it differs from expected, instead of barfing out when it's not exactly matching its expectations.

eero-t · 2024-03-04T17:53:46Z

Tested 6.8.0-rc3 based Xe KMD, and compute/Sysman driver worked with that, so this issue seems to be i915 KMD specific (as expected).

obj-obj · 2024-03-05T20:36:04Z

I can reproduce this on Arch

Disty0 · 2024-03-17T22:55:47Z

I can reproduce this on Arch with Linux 6.8 release (6.8.1-arch1-1) using i915.
Haven't tried xe yet.

Exporting these works fine:

export NEOReadDebugKeys=1
export OverrideGpuAddressSpace=48

ionutnechita-intel · 2024-03-19T14:27:37Z

In this case, will the NEO compute driver have adaptation to working on new behaviour?

DX37 · 2024-03-20T15:02:52Z

Encountered this issue also.

Mstrodl · 2024-03-20T18:52:32Z

On 6.8:

gpuAddressSpace = 281474976706559
= 111111111111111111111111111111111110111111111111

On 6.7:

gpuAddressSpace = 281474976710655
 = 111111111111111111111111111111111111111111111111

The issue seems to lie here:

compute-runtime/shared/source/memory_manager/gfx_partition.cpp

Lines 250 to 253 in 0307854

    
           if (cpuVirtualAddressSize == 48 && gpuAddressSpace == maxNBitValue(48)) { 
        
               gfxBase = maxNBitValue(48 - 1) + 1; 
        
               heapInit(HeapIndex::heapSvm, 0ull, gfxBase); 
        
           } else if (gpuAddressSpace == maxNBitValue(47)) {

eero-t · 2024-03-22T09:48:52Z

In this case, will the NEO compute driver have adaptation to working on new behaviour?

It seems that change in value reported by the GTT size ioctl() may be reverted in i915 kernel driver: https://patchwork.freedesktop.org/series/131095/

(I.e. KMD would only internally use the "usable" GTT size value, and report full address space to user space, including the reserved parts, and distros using 6.8.0 kernel need to patch their kernels until upstream releases updated kernel.)

@JablonskiMateusz Maybe compute-runtime could do some BAT tests also with latest drm-tip kernel, to catch such changes before they are sent to upstream kernel? This change was in drm-tip repo i915 KMD already in 6.7...

nyanmisaka · 2024-03-25T11:47:54Z

Note that the upcoming Ubuntu 24.04 LTS uses the non-LTS 6.8 kernel. Hopefully it can be fixed before it's released next month. Otherwise OpenCL will not be available on many distros based on it.

ionutnechita-intel · 2024-03-25T12:05:09Z

Thanks

obj-obj · 2024-03-25T17:09:04Z

rusticl-mesa actually still works fine in my testing, even though intel-compute-runtime doesn't work at all

nyanmisaka · 2024-03-25T17:52:54Z

rusticl-mesa actually still works fine in my testing, even though intel-compute-runtime doesn't work at all

rusticl is still an experimental implementation and according to Mesa it is currently broken on Arc GPUs. My use case is video processing and only NEO supports zero-copy interop between VA-API and OpenCL through cl_intel_va_api_media_sharing.

…l 6.8 - intel/compute-runtime#710

TimoVerbrugghe · 2024-03-25T23:38:19Z

Just adding as well that I'm also experiencing this issue on nixos when running the latest kernel (6.8.1). GPU (intel N100 alder lake) does not show up in clinfo.

However, on a N5105 machine (Jasper Lake), the GPU did get detected by clinfo on the latest kernel.

However downgrading to 6.7.10 on the N100 machine immediately resolved the issue.

eero-t · 2024-04-15T14:10:20Z

New 6.8.5, 6.8.6 and 6.6.27 LTS kernels are unable to run using the GPU.

@Disty0 If issue happens also with 6.6 kernel, I do not think it to be related to this issue => please file a separate one, and report also compute-runtime version, and where perf reports CPU usage to happen (run as root):

# perf record -a
<wait a min or two>
^C
# perf report -n

eero-t · 2024-04-15T15:07:43Z

Release: https://github.com/intel/compute-runtime/releases/tag/24.09.28717.12

Um, its release notes mention it still needing the env var workaround?

Slightly newer tag includes actual fix:
24.09.28717.12...24.09.28717.14

chao-camect · 2024-04-17T07:50:08Z

Release: https://github.com/intel/compute-runtime/releases/tag/24.09.28717.12

Um, its release notes mention it still needing the env var workaround?

Slightly new tag includes actual fix: 24.09.28717.12...24.09.28717.14

Right. I was trying to see why 24.09.28717.12 still didn't work for me and read your reply.
Thanks. This saved me time.

tjaalton · 2024-04-17T08:02:32Z

I applied this commit on top of the version currently shipped by Arch Linux (23.48.27912.11) and it fixed the problem with my i5-7200U iGPU, now clinfo is able to detect it and I could successfully run some admittedly simple OpenCL programs on Linux 6.8.2 (without any extra environment variables).

FYI: @tjaalton Ubuntu 24.04 LTS is also having a 6.8+ kernel, so its compute-runtime packages needs this too.

uploaded the fix to noble, thanks for the ping

Fixes an upstream issue in the last version intel/compute-runtime#710

Disty0 · 2024-04-21T14:15:41Z

This issue seems to be fixed with aur/intel-compute-runtime-bin 24.13.29138.7-1 on my end. (Arch Linux 6.8.4)

JablonskiMateusz · 2024-04-24T07:55:57Z

since issue seems to be fixed, can we now close the issue?

ionutnechita-intel · 2024-04-24T07:57:54Z

Hello @JablonskiMateusz ,

I think this issue is fixed now.

Maybe is fine to close this ticket.

simonlui · 2024-05-01T07:43:22Z

@ionutnechita-intel Sorry, but this doesn't work inside an OCI container with podman for whatever reason. Not sure if it is also an issue with Docker but I would presume it would be a problem as well. You have to export the two environment variables NEOReadDebugKeys=1 and OverrideGpuAddressSpace=48 for the GPU to be seen inside the container but not on the host machine. I don't know if you want to consider it the same bug but if not, I can open a new bug report for this.

joanbm · 2024-05-01T18:19:32Z

@simonlui Are you sure that the version of the Intel Compute Runtime installed inside the container contains the fix? I can imagine your situation happening if this were not the case. For reference, my iGPU appears to be correctly detected by clinfo inside an Arch Linux-based container.

simonlui · 2024-05-02T06:19:59Z

@joanbm Yeah that was it. I was confused why I was hitting this in the oneapi-basekit Docker image but it was last updated a month ago at the time of writing this so it makes sense why it still had the issue without the updated version of the runtime inside the container.

mattcurf · 2024-05-04T17:49:50Z

@JablonskiMateusz When will this fix be posted to the apt repo at https://repositories.intel.com/gpu/ubuntu?

ionutnechita-intel · 2024-05-08T06:07:20Z

Hi @simonlui,

I understand what you are saying. but it must be checked more thoroughly, with several OS variants as a container.

I tested it on Ubuntu 24.04, directly on the physical machine, with the latest update, and I didn't see the problem anymore.

simonlui · 2024-05-08T07:41:54Z

@ionutnechita-intel The problem was fixed, it was an outdated compute runtime package inside the oneapi-basekit Docker image which didn't have the updated runtime installed by default. Updating the package manually fixed the issue.

ionutnechita-intel · 2024-05-09T07:25:11Z

Hi @simonlui,

Thank you for feedback.

A good day.

ionutnechita-intel changed the title ~~NEO driver is not detect for GPU when using kernel 6.8.x.~~ NEO driver not detect GPU when using kernel 6.8.x. Feb 29, 2024

JablonskiMateusz added the in queue label Mar 4, 2024

TimoVerbrugghe added a commit to TimoVerbrugghe/homelab-monorepo that referenced this issue Mar 25, 2024

Moving to 6.7 kernel due to issue with intel-compute-runtime on kerne…

b3e3480

…l 6.8 - intel/compute-runtime#710

TimoVerbrugghe mentioned this issue Mar 25, 2024

oneVPL for intel gpus NixOS/nixpkgs#264621

Merged

13 tasks

AdamCetnerowski removed the in queue label Mar 26, 2024

JablonskiMateusz mentioned this issue Mar 26, 2024

Arc GPU not showing up in clinfo nor sycl-ls on Linux #714

Closed

ghost mentioned this issue Mar 28, 2024

Can't detect GPU devices intel/intel-extension-for-pytorch#538

Open

eero-t mentioned this issue Apr 16, 2024

How to verify GPU driver is properly installed on Meteor Lake Platform #720

Closed

This was referenced Apr 16, 2024

OpenCL not working in most recent releases/pre-releases thor2002ro/unraid_kernel#28

Closed

[BUG] fyi | latest version of opencl-intel is unable to detect gpu's on linux kernels newer than 6.7.x linuxserver/docker-mods#878

Closed

nyanmisaka added a commit to nyanmisaka/jellyfin-packaging that referenced this issue Apr 19, 2024

Bump Intel Compute Runtime to 24.13.29138.7

928b1b3

Fixes an upstream issue in the last version intel/compute-runtime#710

nyanmisaka mentioned this issue Apr 19, 2024

Bump Intel Compute Runtime to 24.13.29138.7 jellyfin/jellyfin-packaging#15

Merged

notsyncing mentioned this issue Apr 21, 2024

clpeak and llama.cpp stuck at 100% CPU on 6.8.5 kernel #726

Closed

kwaa mentioned this issue Apr 23, 2024

[Feature Request] IPEX-LLM + Axolotl Docker Image intel-analytics/ipex-llm#10821

Closed

melvyn2 mentioned this issue Apr 23, 2024

intel-compute-runtime: 24.09.28717.12 -> 24.13.29138.7 NixOS/nixpkgs#305374

Merged

JablonskiMateusz added the merged change was merged label Apr 24, 2024

JablonskiMateusz closed this as completed Apr 24, 2024

melvyn2 mentioned this issue Apr 25, 2024

[Backport release-23.11] level-zero: 1.15.1 -> 1.16.4, intel-graphics-compiler: 1.0.14828.8 -> 1.0.16238.4, intel-compute-runtime: 23.30.26918.20 -> 24.13.29138.7 NixOS/nixpkgs#306833

Closed

13 tasks

simonlui mentioned this issue May 3, 2024

Llama.cpp not working with intel ARC 770? ggerganov/llama.cpp#7042

Closed

mattcurf mentioned this issue May 5, 2024

Does not find GPU when using Kernel 6.8 with Ubuntu 24.04 mattcurf/ollama-intel-gpu#1

Closed

This was referenced May 16, 2024

immich-machine-learning openvino not working (transcoding works) immich-app/immich#9523

Closed

fix(ml): openvino not working with kernel 6.7.5 or later immich-app/immich#9541

Merged

doucej mentioned this issue Jun 9, 2024

Ollama Linux seg fault with GPU on Ubuntu 22.04 intel-analytics/ipex-llm#11269

Open

NEO driver not detect GPU when using kernel 6.8.x. #710

NEO driver not detect GPU when using kernel 6.8.x. #710

Comments

ionutnechita-intel commented Feb 27, 2024 • edited Loading

eero-t commented Feb 28, 2024

ionutnechita-intel commented Feb 29, 2024

eero-t commented Feb 29, 2024 • edited Loading

ionutnechita-intel commented Feb 29, 2024

eero-t commented Feb 29, 2024 • edited Loading

JablonskiMateusz commented Mar 1, 2024

eero-t commented Mar 1, 2024

eero-t commented Mar 1, 2024

ionutnechita-intel commented Mar 1, 2024

ionutnechita-intel commented Mar 1, 2024

eero-t commented Mar 1, 2024

eero-t commented Mar 4, 2024

obj-obj commented Mar 5, 2024

Disty0 commented Mar 17, 2024 • edited Loading

ionutnechita-intel commented Mar 19, 2024

DX37 commented Mar 20, 2024

Mstrodl commented Mar 20, 2024

eero-t commented Mar 22, 2024 • edited Loading

nyanmisaka commented Mar 25, 2024

ionutnechita-intel commented Mar 25, 2024

obj-obj commented Mar 25, 2024

nyanmisaka commented Mar 25, 2024

TimoVerbrugghe commented Mar 25, 2024 • edited Loading

eero-t commented Apr 15, 2024

eero-t commented Apr 15, 2024 • edited Loading

chao-camect commented Apr 17, 2024

tjaalton commented Apr 17, 2024

Disty0 commented Apr 21, 2024

JablonskiMateusz commented Apr 24, 2024

ionutnechita-intel commented Apr 24, 2024

simonlui commented May 1, 2024 • edited Loading

joanbm commented May 1, 2024

simonlui commented May 2, 2024 • edited Loading

mattcurf commented May 4, 2024

ionutnechita-intel commented May 8, 2024

simonlui commented May 8, 2024 • edited Loading

ionutnechita-intel commented May 9, 2024

ionutnechita-intel commented Feb 27, 2024 •

edited

Loading

eero-t commented Feb 29, 2024 •

edited

Loading

eero-t commented Feb 29, 2024 •

edited

Loading

Disty0 commented Mar 17, 2024 •

edited

Loading

eero-t commented Mar 22, 2024 •

edited

Loading

TimoVerbrugghe commented Mar 25, 2024 •

edited

Loading

eero-t commented Apr 15, 2024 •

edited

Loading

simonlui commented May 1, 2024 •

edited

Loading

simonlui commented May 2, 2024 •

edited

Loading

simonlui commented May 8, 2024 •

edited

Loading