Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEO driver not detect GPU when using kernel 6.8.x. #710

Closed
ionutnechita-intel opened this issue Feb 27, 2024 · 44 comments
Closed

NEO driver not detect GPU when using kernel 6.8.x. #710

ionutnechita-intel opened this issue Feb 27, 2024 · 44 comments
Labels
merged change was merged

Comments

@ionutnechita-intel
Copy link

ionutnechita-intel commented Feb 27, 2024

NEO driver is not detect for GPU when using kernel 6.8.x.

When have kernel 6.5.x and 6.6.x this is present.

/opt/intel/oneapi/compiler/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:2] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [24.05.28454.6]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28454]

And on kernel 6.8.x have this:

/opt/intel/oneapi/compiler/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:2] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
@eero-t
Copy link

eero-t commented Feb 28, 2024

I can reproduce this with latest drm-tip 6.8.0-rc6 kernel, using earlier built (2024-02-09) compute-runtime master branch, or earlier compute-runtime releases => Neither clinfo nor zello_sysman recognizes the GPU. vainfo / vpl-inspect media tools still recognize the GPU though, so it's compute stack specific issue.

I do not see any difference in strace output (between old an new kernels) before compute-runtime decides to give up, so it's a bit mystery why it decides not to recognize the GPU.

@ionutnechita-intel
Copy link
Author

Thank you for reproduced this.

On 6.7.x, GPU is recognized.
Only 6.8.x is not recognized.

@eero-t
Copy link

eero-t commented Feb 29, 2024

Yes, it works with 6.7 (drm-tip) kernel also for me, just not with 6.8 (i915 KMD).

EDIT: that was with public Xe KMD repo, not drm-tip. With drm-tip, the issue is already with earlier kernel version (see below).

@ionutnechita-intel
Copy link
Author

I tested with 6.8.0-rc1(6.8.0-060800rc1-generic) and this issue is reproduced.

Maybe between 6.7 and 6.8.0-rc1 appear this issue.

I notice several commits with new Xe Intel driver and fixing eDP/DisplayPort in 6.8.0-rc1.

I not have time to bisect for detect what commit/commits cause this behaviour.

@ionutnechita-intel ionutnechita-intel changed the title NEO driver is not detect for GPU when using kernel 6.8.x. NEO driver not detect GPU when using kernel 6.8.x. Feb 29, 2024
@eero-t
Copy link

eero-t commented Feb 29, 2024

Dang. I was comparing "drm-tip" on TGL against "xe-drm-next" kernel on DG1, but their i915 KMD codes seem to progress at different rates, so I had to do quick bisection using already existing nightly "drm-tip" builds...

While things work still with 6.7 version of "xe-drm-next" kernel repo, with the "drm-tip" repo kernel, clinfo & zello_sysman actually broke already earlier, somewhere between couple of "drm-tip" repo upstream 6.6-rc7 kernel integration changes:

  • drm-tip: 2023y-10m-29d-09h-52m-45s UTC integration manifest
  • drm-tip: 2023y-10m-31d-13h-47m-12s UTC integration manifest

(Commits named like those, or the original commits are not any more in "drm-tip" repo, as it gets constantly rebased to upstream, so I cannot provide list of commits between them any more.)

@JablonskiMateusz
Copy link
Contributor

Hi folks,
we also observe issue with 6.8 kernel - i915 reports different I915_CONTEXT_PARAM_GTT_SIZE. As a workaround could you try to run application with additional env - NEOReadDebugKeys=1 OverrideGpuAddressSpace=48 ?

@eero-t
Copy link

eero-t commented Mar 1, 2024

we also observe issue with 6.8 kernel - i915 reports different I915_CONTEXT_PARAM_GTT_SIZE.

Media and 3D drivers seem to work fine with that change, why it's a problem for L0/compute stack?

(I'm wondering whether this change should be reported to upstream as kernel stable ABI breakage...)

Looking at the compute-runtime code, it seems to affect SVM capability & address space size:
https://github.com/intel/compute-runtime/blob/master/shared/source/os_interface/linux/product_helper_drm.cpp#L128

Where's in Mesa code:
https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/vulkan/anv_device.c#L2300

@eero-t
Copy link

eero-t commented Mar 1, 2024

As a workaround could you try to run application with additional env - NEOReadDebugKeys=1 OverrideGpuAddressSpace=48 ?

Yes, with those both clinfo & zello_sysman work just fine (on TGL-H iGPU).

@ionutnechita-intel
Copy link
Author

Hi @eero-t,

Using latest drm-tip version with variable in environment, GPU appear.

# /opt/intel/oneapi/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:2] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
# NEOReadDebugKeys=1 OverrideGpuAddressSpace=48 /opt/intel/oneapi/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:1] Intel(R) OpenCL HD Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [23.13.026032]
[opencl:cpu:2] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:3] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.26032]
# uname -a
Linux 6.8.0-rc6-lowlatency1 #1 SMP PREEMPT_DYNAMIC Fri Mar  1 09:38:45 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
# lscpu | grep "Model name"
Model name:                         11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz

@ionutnechita-intel
Copy link
Author

In this case issue is from Kernel or NEO driver/OpenCL?

@eero-t
Copy link

eero-t commented Mar 1, 2024

Well, it depends the GTT size value returned by the KMD is thought to be part of stable ABI, but I do not see how it could be, as there can be different reasons for those values to differ. I would think that NEO should accept / adapt to sensible GTT size values, potentially with a warning when it differs from expected, instead of barfing out when it's not exactly matching its expectations.

@eero-t
Copy link

eero-t commented Mar 4, 2024

Tested 6.8.0-rc3 based Xe KMD, and compute/Sysman driver worked with that, so this issue seems to be i915 KMD specific (as expected).

@obj-obj
Copy link

obj-obj commented Mar 5, 2024

I can reproduce this on Arch

@Disty0
Copy link

Disty0 commented Mar 17, 2024

I can reproduce this on Arch with Linux 6.8 release (6.8.1-arch1-1) using i915.
Haven't tried xe yet.

Exporting these works fine:

export NEOReadDebugKeys=1
export OverrideGpuAddressSpace=48

@ionutnechita-intel
Copy link
Author

In this case, will the NEO compute driver have adaptation to working on new behaviour?

@DX37
Copy link

DX37 commented Mar 20, 2024

Encountered this issue also.

@Mstrodl
Copy link

Mstrodl commented Mar 20, 2024

On 6.8:

gpuAddressSpace = 281474976706559
= 111111111111111111111111111111111110111111111111

On 6.7:

gpuAddressSpace = 281474976710655
 = 111111111111111111111111111111111111111111111111

The issue seems to lie here:

if (cpuVirtualAddressSize == 48 && gpuAddressSpace == maxNBitValue(48)) {
gfxBase = maxNBitValue(48 - 1) + 1;
heapInit(HeapIndex::heapSvm, 0ull, gfxBase);
} else if (gpuAddressSpace == maxNBitValue(47)) {

@eero-t
Copy link

eero-t commented Mar 22, 2024

In this case, will the NEO compute driver have adaptation to working on new behaviour?

It seems that change in value reported by the GTT size ioctl() may be reverted in i915 kernel driver: https://patchwork.freedesktop.org/series/131095/

(I.e. KMD would only internally use the "usable" GTT size value, and report full address space to user space, including the reserved parts, and distros using 6.8.0 kernel need to patch their kernels until upstream releases updated kernel.)

@JablonskiMateusz Maybe compute-runtime could do some BAT tests also with latest drm-tip kernel, to catch such changes before they are sent to upstream kernel? This change was in drm-tip repo i915 KMD already in 6.7...

@nyanmisaka
Copy link
Contributor

Note that the upcoming Ubuntu 24.04 LTS uses the non-LTS 6.8 kernel. Hopefully it can be fixed before it's released next month. Otherwise OpenCL will not be available on many distros based on it.

@ionutnechita-intel
Copy link
Author

Thanks

@obj-obj
Copy link

obj-obj commented Mar 25, 2024

rusticl-mesa actually still works fine in my testing, even though intel-compute-runtime doesn't work at all

@nyanmisaka
Copy link
Contributor

rusticl-mesa actually still works fine in my testing, even though intel-compute-runtime doesn't work at all

rusticl is still an experimental implementation and according to Mesa it is currently broken on Arc GPUs. My use case is video processing and only NEO supports zero-copy interop between VA-API and OpenCL through cl_intel_va_api_media_sharing.

@TimoVerbrugghe
Copy link

TimoVerbrugghe commented Mar 25, 2024

Just adding as well that I'm also experiencing this issue on nixos when running the latest kernel (6.8.1). GPU (intel N100 alder lake) does not show up in clinfo.

However, on a N5105 machine (Jasper Lake), the GPU did get detected by clinfo on the latest kernel.

However downgrading to 6.7.10 on the N100 machine immediately resolved the issue.

@eero-t
Copy link

eero-t commented Apr 15, 2024

New 6.8.5, 6.8.6 and 6.6.27 LTS kernels are unable to run using the GPU.

@Disty0 If issue happens also with 6.6 kernel, I do not think it to be related to this issue => please file a separate one, and report also compute-runtime version, and where perf reports CPU usage to happen (run as root):

# perf record -a
<wait a min or two>
^C
# perf report -n

@eero-t
Copy link

eero-t commented Apr 15, 2024

Release: https://github.com/intel/compute-runtime/releases/tag/24.09.28717.12

Um, its release notes mention it still needing the env var workaround?

Slightly newer tag includes actual fix:
24.09.28717.12...24.09.28717.14

@chao-camect
Copy link

Release: https://github.com/intel/compute-runtime/releases/tag/24.09.28717.12

Um, its release notes mention it still needing the env var workaround?

Slightly new tag includes actual fix: 24.09.28717.12...24.09.28717.14

Right. I was trying to see why 24.09.28717.12 still didn't work for me and read your reply.
Thanks. This saved me time.

@tjaalton
Copy link
Contributor

I applied this commit on top of the version currently shipped by Arch Linux (23.48.27912.11) and it fixed the problem with my i5-7200U iGPU, now clinfo is able to detect it and I could successfully run some admittedly simple OpenCL programs on Linux 6.8.2 (without any extra environment variables).

FYI: @tjaalton Ubuntu 24.04 LTS is also having a 6.8+ kernel, so its compute-runtime packages needs this too.

uploaded the fix to noble, thanks for the ping

@Disty0
Copy link

Disty0 commented Apr 21, 2024

This issue seems to be fixed with aur/intel-compute-runtime-bin 24.13.29138.7-1 on my end. (Arch Linux 6.8.4)

@JablonskiMateusz
Copy link
Contributor

since issue seems to be fixed, can we now close the issue?

@JablonskiMateusz JablonskiMateusz added the merged change was merged label Apr 24, 2024
@ionutnechita-intel
Copy link
Author

Hello @JablonskiMateusz ,

I think this issue is fixed now.

Maybe is fine to close this ticket.

@simonlui
Copy link

simonlui commented May 1, 2024

@ionutnechita-intel Sorry, but this doesn't work inside an OCI container with podman for whatever reason. Not sure if it is also an issue with Docker but I would presume it would be a problem as well. You have to export the two environment variables NEOReadDebugKeys=1 and OverrideGpuAddressSpace=48 for the GPU to be seen inside the container but not on the host machine. I don't know if you want to consider it the same bug but if not, I can open a new bug report for this.

@joanbm
Copy link

joanbm commented May 1, 2024

@simonlui Are you sure that the version of the Intel Compute Runtime installed inside the container contains the fix? I can imagine your situation happening if this were not the case. For reference, my iGPU appears to be correctly detected by clinfo inside an Arch Linux-based container.

@simonlui
Copy link

simonlui commented May 2, 2024

@joanbm Yeah that was it. I was confused why I was hitting this in the oneapi-basekit Docker image but it was last updated a month ago at the time of writing this so it makes sense why it still had the issue without the updated version of the runtime inside the container.

@mattcurf
Copy link

mattcurf commented May 4, 2024

@JablonskiMateusz When will this fix be posted to the apt repo at https://repositories.intel.com/gpu/ubuntu?

@ionutnechita-intel
Copy link
Author

Hi @simonlui,

I understand what you are saying. but it must be checked more thoroughly, with several OS variants as a container.

I tested it on Ubuntu 24.04, directly on the physical machine, with the latest update, and I didn't see the problem anymore.

@simonlui
Copy link

simonlui commented May 8, 2024

@ionutnechita-intel The problem was fixed, it was an outdated compute runtime package inside the oneapi-basekit Docker image which didn't have the updated runtime installed by default. Updating the package manually fixed the issue.

@ionutnechita-intel
Copy link
Author

Hi @simonlui,

Thank you for feedback.

A good day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged change was merged
Projects
None yet
Development

No branches or pull requests