Slow Kernel Launch Times #600

FreddieWitherden · 2023-01-09T21:39:33Z

Using clpeak with runtime 22.43.24595 on the integrated GPU on an i7-12700H CPU under Linux I find the kernel launch latency to be 42.46 us. This is around 10 times slower than can be expected for a recent discrete GPU from AMD/NVIDIA connected over PCIe. In general, one would expect integrated GPUs to have an advantage here since the CPU and GPU both share an LLC.

MichalMrozek · 2023-01-10T12:26:04Z

Integrated GPU's goes over KMD and GuC for submissions, hence the times you observe are within expected range.

FreddieWitherden · 2023-01-10T13:30:30Z

Is this a hardware limitation? On Linux, both AMD and NVIDIA have moved to userspace submission so launching a kernel is reduced to a memcpy + atomic. In addition to massively reducing latency this also avoids the need for clFlush calls to get submitted work items running.

For an A770M I've observed launch times on the order of ~15 us on the same system; better, but still over three times higher than what I'm used to seeing.

MichalMrozek · 2023-01-11T11:54:49Z

On A770M launch time should be around 8us.
What is your operating system ?

"Is this a hardware limitation? On Linux, both AMD and NVIDIA have moved to userspace submission so launching a kernel is reduced to a memcpy + atomic. In addition to massively reducing latency this also avoids the need for clFlush calls to get submitted work items running."

It is not a hardware limitation, integrated parts are also capable of having direct submission. The code is not ready for enabling though, it requires VM_BIND to be enabled in Linux Kernel, which is not the case for integrated parts.

FreddieWitherden · 2023-01-11T13:46:31Z

On A770M launch time should be around 8us. What is your operating system ?

Those measurements were take on Linux with a 6.1 kernel. I'd repeat them although I currently can not get OpenCL to work on my A770 (although it works fine as a display adapter).

eero-t · 2023-01-16T17:17:20Z

Those measurements were take on Linux with a 6.1 kernel. I'd repeat them although I currently can not get OpenCL to work on my A770 (although it works fine as a display adapter).

Best Intel dGPU support is currently in backport kernels: https://github.com/intel-gpu/intel-gpu-i915-backports

Binary packages for those are available from Intel repository: https://dgpu-docs.intel.com/installation-guides/index.html

Intel dGPU support enabled in v6.2(-rc1) upstream Linux kernel, is still somewhat lacking behind those.

And older upstream kernel versions are missing even more features, besides needing force-probe option even to recognize Intel dGPUs.

FreddieWitherden · 2023-02-20T20:13:09Z

Having upgraded to 6.2 I now have numbers for my A770M and the integrated GPU. Specifically, the A770M clocks in at 99.43us and the i7-12700H is still there at ~45us.

These numbers from clpeak are consistent with my own real-world (but still launch-latency-sensitive) test cases where the integrated GPU is outperforming my A770M despite it having a huge advantage in execution resources and bandwidth.

eero-t · 2023-02-21T09:48:12Z

Having upgraded to 6.2 I now have numbers for my A770M and the integrated GPU. Specifically, the A770M clocks in at 99.43us and the i7-12700H is still there at ~45us.

Is this startup timing for cold start (single run), or average of warm ones (tight loop of runs)?

If you run some other (lightweight) workload for the same GPU in the background (so that GPU frequency is up when your test starts), will dGPU perform better than iGPU?

These numbers from clpeak are consistent with my own real-world (but still launch-latency-sensitive) test cases where the integrated GPU is outperforming my A770M despite it having a huge advantage in execution resources and bandwidth.

Please check with intel_gpu_top (from your distro intel-gpu-tools package) at what frequencies both of these GPUs are running at during your test-cases. Your workloads may be so lightweight for dGPU that it runs at low frequency, but heavy enough to keep iGPU at high frequency.

If this is the case, the issue is rather kernel / FW power management, than compute driver / GPU job submission.

FreddieWitherden · 2023-02-21T12:02:10Z

So clpeak does appear to try and fully saturate the GPU when determining launch latency and uses multiple iterations:

https://github.com/krrishnarraj/clpeak/blob/master/src/kernel_latency.cpp#L11

and the results I get are reproducible.

As for my real-world test case the clocks are pinned at ~2050 Mhz according to intel_gpu_top with an [unknown] engine being busy ~14.5% of the time. The Render/3D is at ~0.3% (I use the A770M as my display adaptor).

eero-t · 2023-02-27T13:12:04Z

Could you give utilization & freq values for both GPUs which latency you are comparing?

You can select which card is shown with the -d option, like this: intel_gpu_top -d drm:/dev/dri/card1.

FreddieWitherden · 2023-02-27T13:52:17Z

For my integrate GPU the clock speed is fixed throughout the entire case (which takes ~6 minutes or so) at ~1400 MHz give or take 5 MHz. The utilisation here is from Render/3D engine which is fixed at about ~40% give or take a percent.

eero-t · 2023-02-27T14:25:00Z

For my integrate GPU the clock speed is fixed throughout the entire case (which takes ~6 minutes or so) at ~1400 MHz give or take 5 MHz. The utilisation here is from Render/3D engine which is fixed at about ~40% give or take a percent.

As for my real-world test case the clocks are pinned at ~2050 Mhz according to intel_gpu_top with an [unknown] engine being busy ~14.5% of the time. The Render/3D is at ~0.3% (I use the A770M as my display adaptor).

Thanks!

So although clpeak is not able to utilized GPU fully (which is a separate problem), at least both are running at full speed, i.e. things are comparable.

PS. On iGPUs, Render/3d is the pipeline with shader cores, which can act in 3D, compute or media mode. Whereas Arc has separate compute engine pipeline. If you build latest IGT version from upstream, it shows [unknown] is the Compute engine: https://gitlab.freedesktop.org/drm/igt-gpu-tools

BartusW · 2023-04-03T12:30:45Z

Hello FreddieWitherden and BA8F0D39

Github issue #600 is still open while answers are already delivered in the middle. Let’s summarize this topic and go for closure. For best and low latency in Arc/Flex dGFX family please consider using Intel OEM kernel which can be obtained following the instruction: https://dgpu-docs.intel.com/installation-guides/index.html: latest version as a package https://dgpu-docs.intel.com/releases/stable_555_20230124.html and source code to build on your own https://github.com/intel-gpu/intel-gpu-i915-backports . OEM kernel driver contains significant change of dispatch model, which allows enabling low latency submission by Compute-Runtime drivers. Direct submission is not available in upstream generic kernel, as mentioned for 6.1/6.2.

With upstreamed 6.1/6.2 generic Linux kernel compute-runtime driver uses legacy submission model, it is same unified path for integrated and discrete Gfx devices. With legacy submission longer kernel latency timings are expected.

FreddieWitherden · 2023-04-03T13:45:40Z

Okay sounds good. Are there any plans to upstream the direct submission paths to the mainline Linux kernel?

BartusW · 2023-04-03T14:22:27Z

At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel.

BA8F0D39 · 2023-04-03T19:36:26Z

@BartusW
Last stable release was in 2023-01-24
Will there be an update?

BartusW · 2023-04-04T09:00:43Z

Are you asking about VM-Bind (the answer is covered above) or just asking in general about kernel launch latency changes?

BA8F0D39 · 2023-04-06T02:01:17Z

@BartusW
I mean, will the kernel packages at
https://repositories.intel.com/graphics/
be updated?

nyanmisaka · 2023-04-07T10:57:42Z

At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel.

Will the new dispatch model be added into the upstream i915 and Xe KMD in the future? Or is it an OEM exclusive feature?

eero-t · 2023-06-16T12:22:31Z

At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel.

Will the new dispatch model be added into the upstream i915 and Xe KMD in the future? Or is it an OEM exclusive feature?

Xe kernel driver is based on VM_BIND. For more info, see:

Merge plan (April 2023): https://lore.kernel.org/dri-devel/20230418133133.80434-1-rodrigo.vivi@intel.com/
Initial submission (Dec 2022): https://patchwork.freedesktop.org/series/112188/

nyanmisaka · 2023-06-16T12:29:26Z

I've read somewhere that there will be no VM_BIND in i915, which will speed up the upstreaming of Xe KMD.

It has become clear that we have a long way towards fully featured implementation of VM_BIND in i915.
Examples of the many challenges include integration with display, integration with userspace drivers,
a rewrite of all the i915 IGTs to support execbuf3, alignment with DRM GPU VA manager[1] etc.

We are stopping further VM_BIND upstreaming efforts in i915 so we can accelerate the merge plan
for the new drm/xe driver[2] which has been designed for VM_BIND from the beginning.

https://lists.freedesktop.org/archives/intel-gfx/2023-April/324237.html

eero-t · 2023-06-16T13:29:42Z

(Disclaimer: I'm not a driver developer, just a spectator, so this is just my clueless observation.)

Doing major architectural change into 915 kernel driver seems practical impossibility to me because it supports 1.5 decades of different Intel GPU HW, I think more than what's listed on this page: https://en.wikipedia.org/wiki/Intel_Graphics_Technology

That span covers multiple generations of user-space compute (OpenCL & SYCL), media, 3D (OpenGL/GLES & Vulkan) drivers (maybe 10 different driver version?), which all would need to be tested and validated, some on very old / hard to get HW, and after large change like that, there likely were still quite a few bugs that users would find only later on. And I do not not see the alternative of kernel driver dropping support for anything older than what the last user-space driver generation supports (i.e. Broadwell up), would be accepted by users either.

BA8F0D39 mentioned this issue Feb 20, 2023

Memory Transfer on A770 16 GB Fails Unit Tests and has Incomplete Level Zero/OpenCL API #618

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Kernel Launch Times #600

Slow Kernel Launch Times #600

FreddieWitherden commented Jan 9, 2023 •

edited

MichalMrozek commented Jan 10, 2023

FreddieWitherden commented Jan 10, 2023 •

edited

MichalMrozek commented Jan 11, 2023

FreddieWitherden commented Jan 11, 2023

eero-t commented Jan 16, 2023

FreddieWitherden commented Feb 20, 2023 •

edited

eero-t commented Feb 21, 2023

FreddieWitherden commented Feb 21, 2023

eero-t commented Feb 27, 2023

FreddieWitherden commented Feb 27, 2023

eero-t commented Feb 27, 2023

BartusW commented Apr 3, 2023

FreddieWitherden commented Apr 3, 2023

BartusW commented Apr 3, 2023

BA8F0D39 commented Apr 3, 2023 •

edited

BartusW commented Apr 4, 2023

BA8F0D39 commented Apr 6, 2023

nyanmisaka commented Apr 7, 2023

eero-t commented Jun 16, 2023

nyanmisaka commented Jun 16, 2023

eero-t commented Jun 16, 2023

Slow Kernel Launch Times #600

Slow Kernel Launch Times #600

Comments

FreddieWitherden commented Jan 9, 2023 • edited

MichalMrozek commented Jan 10, 2023

FreddieWitherden commented Jan 10, 2023 • edited

MichalMrozek commented Jan 11, 2023

FreddieWitherden commented Jan 11, 2023

eero-t commented Jan 16, 2023

FreddieWitherden commented Feb 20, 2023 • edited

eero-t commented Feb 21, 2023

FreddieWitherden commented Feb 21, 2023

eero-t commented Feb 27, 2023

FreddieWitherden commented Feb 27, 2023

eero-t commented Feb 27, 2023

BartusW commented Apr 3, 2023

FreddieWitherden commented Apr 3, 2023

BartusW commented Apr 3, 2023

BA8F0D39 commented Apr 3, 2023 • edited

BartusW commented Apr 4, 2023

BA8F0D39 commented Apr 6, 2023

nyanmisaka commented Apr 7, 2023

eero-t commented Jun 16, 2023

nyanmisaka commented Jun 16, 2023

eero-t commented Jun 16, 2023

FreddieWitherden commented Jan 9, 2023 •

edited

FreddieWitherden commented Jan 10, 2023 •

edited

FreddieWitherden commented Feb 20, 2023 •

edited

BA8F0D39 commented Apr 3, 2023 •

edited