Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Kernel Launch Times #600

Open
FreddieWitherden opened this issue Jan 9, 2023 · 21 comments
Open

Slow Kernel Launch Times #600

FreddieWitherden opened this issue Jan 9, 2023 · 21 comments

Comments

@FreddieWitherden
Copy link

FreddieWitherden commented Jan 9, 2023

Using clpeak with runtime 22.43.24595 on the integrated GPU on an i7-12700H CPU under Linux I find the kernel launch latency to be 42.46 us. This is around 10 times slower than can be expected for a recent discrete GPU from AMD/NVIDIA connected over PCIe. In general, one would expect integrated GPUs to have an advantage here since the CPU and GPU both share an LLC.

@MichalMrozek
Copy link
Contributor

Integrated GPU's goes over KMD and GuC for submissions, hence the times you observe are within expected range.

@FreddieWitherden
Copy link
Author

FreddieWitherden commented Jan 10, 2023

Is this a hardware limitation? On Linux, both AMD and NVIDIA have moved to userspace submission so launching a kernel is reduced to a memcpy + atomic. In addition to massively reducing latency this also avoids the need for clFlush calls to get submitted work items running.

For an A770M I've observed launch times on the order of ~15 us on the same system; better, but still over three times higher than what I'm used to seeing.

@MichalMrozek
Copy link
Contributor

On A770M launch time should be around 8us.
What is your operating system ?

"Is this a hardware limitation? On Linux, both AMD and NVIDIA have moved to userspace submission so launching a kernel is reduced to a memcpy + atomic. In addition to massively reducing latency this also avoids the need for clFlush calls to get submitted work items running."

It is not a hardware limitation, integrated parts are also capable of having direct submission. The code is not ready for enabling though, it requires VM_BIND to be enabled in Linux Kernel, which is not the case for integrated parts.

@FreddieWitherden
Copy link
Author

On A770M launch time should be around 8us. What is your operating system ?

Those measurements were take on Linux with a 6.1 kernel. I'd repeat them although I currently can not get OpenCL to work on my A770 (although it works fine as a display adapter).

@eero-t
Copy link

eero-t commented Jan 16, 2023

Those measurements were take on Linux with a 6.1 kernel. I'd repeat them although I currently can not get OpenCL to work on my A770 (although it works fine as a display adapter).

Best Intel dGPU support is currently in backport kernels: https://github.com/intel-gpu/intel-gpu-i915-backports

Binary packages for those are available from Intel repository: https://dgpu-docs.intel.com/installation-guides/index.html

Intel dGPU support enabled in v6.2(-rc1) upstream Linux kernel, is still somewhat lacking behind those.

And older upstream kernel versions are missing even more features, besides needing force-probe option even to recognize Intel dGPUs.

@FreddieWitherden
Copy link
Author

FreddieWitherden commented Feb 20, 2023

Having upgraded to 6.2 I now have numbers for my A770M and the integrated GPU. Specifically, the A770M clocks in at 99.43us and the i7-12700H is still there at ~45us.

These numbers from clpeak are consistent with my own real-world (but still launch-latency-sensitive) test cases where the integrated GPU is outperforming my A770M despite it having a huge advantage in execution resources and bandwidth.

@eero-t
Copy link

eero-t commented Feb 21, 2023

Having upgraded to 6.2 I now have numbers for my A770M and the integrated GPU. Specifically, the A770M clocks in at 99.43us and the i7-12700H is still there at ~45us.

Is this startup timing for cold start (single run), or average of warm ones (tight loop of runs)?

If you run some other (lightweight) workload for the same GPU in the background (so that GPU frequency is up when your test starts), will dGPU perform better than iGPU?

These numbers from clpeak are consistent with my own real-world (but still launch-latency-sensitive) test cases where the integrated GPU is outperforming my A770M despite it having a huge advantage in execution resources and bandwidth.

Please check with intel_gpu_top (from your distro intel-gpu-tools package) at what frequencies both of these GPUs are running at during your test-cases. Your workloads may be so lightweight for dGPU that it runs at low frequency, but heavy enough to keep iGPU at high frequency.

If this is the case, the issue is rather kernel / FW power management, than compute driver / GPU job submission.

@FreddieWitherden
Copy link
Author

So clpeak does appear to try and fully saturate the GPU when determining launch latency and uses multiple iterations:

https://github.com/krrishnarraj/clpeak/blob/master/src/kernel_latency.cpp#L11

and the results I get are reproducible.

As for my real-world test case the clocks are pinned at ~2050 Mhz according to intel_gpu_top with an [unknown] engine being busy ~14.5% of the time. The Render/3D is at ~0.3% (I use the A770M as my display adaptor).

@eero-t
Copy link

eero-t commented Feb 27, 2023

Could you give utilization & freq values for both GPUs which latency you are comparing?

You can select which card is shown with the -d option, like this: intel_gpu_top -d drm:/dev/dri/card1.

@FreddieWitherden
Copy link
Author

For my integrate GPU the clock speed is fixed throughout the entire case (which takes ~6 minutes or so) at ~1400 MHz give or take 5 MHz. The utilisation here is from Render/3D engine which is fixed at about ~40% give or take a percent.

@eero-t
Copy link

eero-t commented Feb 27, 2023

For my integrate GPU the clock speed is fixed throughout the entire case (which takes ~6 minutes or so) at ~1400 MHz give or take 5 MHz. The utilisation here is from Render/3D engine which is fixed at about ~40% give or take a percent.

As for my real-world test case the clocks are pinned at ~2050 Mhz according to intel_gpu_top with an [unknown] engine being busy ~14.5% of the time. The Render/3D is at ~0.3% (I use the A770M as my display adaptor).

Thanks!

So although clpeak is not able to utilized GPU fully (which is a separate problem), at least both are running at full speed, i.e. things are comparable.

PS. On iGPUs, Render/3d is the pipeline with shader cores, which can act in 3D, compute or media mode. Whereas Arc has separate compute engine pipeline. If you build latest IGT version from upstream, it shows [unknown] is the Compute engine: https://gitlab.freedesktop.org/drm/igt-gpu-tools

@BartusW
Copy link
Contributor

BartusW commented Apr 3, 2023

Hello FreddieWitherden and BA8F0D39

Github issue #600 is still open while answers are already delivered in the middle. Let’s summarize this topic and go for closure. For best and low latency in Arc/Flex dGFX family please consider using Intel OEM kernel which can be obtained following the instruction: https://dgpu-docs.intel.com/installation-guides/index.html: latest version as a package https://dgpu-docs.intel.com/releases/stable_555_20230124.html and source code to build on your own https://github.com/intel-gpu/intel-gpu-i915-backports . OEM kernel driver contains significant change of dispatch model, which allows enabling low latency submission by Compute-Runtime drivers. Direct submission is not available in upstream generic kernel, as mentioned for 6.1/6.2.

With upstreamed 6.1/6.2 generic Linux kernel compute-runtime driver uses legacy submission model, it is same unified path for integrated and discrete Gfx devices. With legacy submission longer kernel latency timings are expected.

@FreddieWitherden
Copy link
Author

Okay sounds good. Are there any plans to upstream the direct submission paths to the mainline Linux kernel?

@BartusW
Copy link
Contributor

BartusW commented Apr 3, 2023

At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel.

@BA8F0D39
Copy link

BA8F0D39 commented Apr 3, 2023

@BartusW
Last stable release was in 2023-01-24
Will there be an update?

@BartusW
Copy link
Contributor

BartusW commented Apr 4, 2023

Are you asking about VM-Bind (the answer is covered above) or just asking in general about kernel launch latency changes?

@BA8F0D39
Copy link

BA8F0D39 commented Apr 6, 2023

@BartusW
I mean, will the kernel packages at
https://repositories.intel.com/graphics/
be updated?

@nyanmisaka
Copy link
Contributor

At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel.

Will the new dispatch model be added into the upstream i915 and Xe KMD in the future? Or is it an OEM exclusive feature?

@eero-t
Copy link

eero-t commented Jun 16, 2023

At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel.

Will the new dispatch model be added into the upstream i915 and Xe KMD in the future? Or is it an OEM exclusive feature?

Xe kernel driver is based on VM_BIND. For more info, see:

@nyanmisaka
Copy link
Contributor

I've read somewhere that there will be no VM_BIND in i915, which will speed up the upstreaming of Xe KMD.

It has become clear that we have a long way towards fully featured implementation of VM_BIND in i915.
Examples of the many challenges include integration with display, integration with userspace drivers,
a rewrite of all the i915 IGTs to support execbuf3, alignment with DRM GPU VA manager[1] etc.

We are stopping further VM_BIND upstreaming efforts in i915 so we can accelerate the merge plan
for the new drm/xe driver[2] which has been designed for VM_BIND from the beginning.

https://lists.freedesktop.org/archives/intel-gfx/2023-April/324237.html

@eero-t
Copy link

eero-t commented Jun 16, 2023

(Disclaimer: I'm not a driver developer, just a spectator, so this is just my clueless observation.)

Doing major architectural change into 915 kernel driver seems practical impossibility to me because it supports 1.5 decades of different Intel GPU HW, I think more than what's listed on this page: https://en.wikipedia.org/wiki/Intel_Graphics_Technology

That span covers multiple generations of user-space compute (OpenCL & SYCL), media, 3D (OpenGL/GLES & Vulkan) drivers (maybe 10 different driver version?), which all would need to be tested and validated, some on very old / hard to get HW, and after large change like that, there likely were still quite a few bugs that users would find only later on. And I do not not see the alternative of kernel driver dropping support for anything older than what the last user-space driver generation supports (i.e. Broadwell up), would be accepted by users either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants