New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow Kernel Launch Times #600
Comments
Integrated GPU's goes over KMD and GuC for submissions, hence the times you observe are within expected range. |
Is this a hardware limitation? On Linux, both AMD and NVIDIA have moved to userspace submission so launching a kernel is reduced to a memcpy + atomic. In addition to massively reducing latency this also avoids the need for For an A770M I've observed launch times on the order of ~15 us on the same system; better, but still over three times higher than what I'm used to seeing. |
On A770M launch time should be around 8us. "Is this a hardware limitation? On Linux, both AMD and NVIDIA have moved to userspace submission so launching a kernel is reduced to a memcpy + atomic. In addition to massively reducing latency this also avoids the need for clFlush calls to get submitted work items running." It is not a hardware limitation, integrated parts are also capable of having direct submission. The code is not ready for enabling though, it requires VM_BIND to be enabled in Linux Kernel, which is not the case for integrated parts. |
Those measurements were take on Linux with a 6.1 kernel. I'd repeat them although I currently can not get OpenCL to work on my A770 (although it works fine as a display adapter). |
Best Intel dGPU support is currently in backport kernels: https://github.com/intel-gpu/intel-gpu-i915-backports Binary packages for those are available from Intel repository: https://dgpu-docs.intel.com/installation-guides/index.html Intel dGPU support enabled in v6.2(-rc1) upstream Linux kernel, is still somewhat lacking behind those. And older upstream kernel versions are missing even more features, besides needing force-probe option even to recognize Intel dGPUs. |
Having upgraded to 6.2 I now have numbers for my A770M and the integrated GPU. Specifically, the A770M clocks in at 99.43us and the i7-12700H is still there at ~45us. These numbers from clpeak are consistent with my own real-world (but still launch-latency-sensitive) test cases where the integrated GPU is outperforming my A770M despite it having a huge advantage in execution resources and bandwidth. |
Is this startup timing for cold start (single run), or average of warm ones (tight loop of runs)? If you run some other (lightweight) workload for the same GPU in the background (so that GPU frequency is up when your test starts), will dGPU perform better than iGPU?
Please check with If this is the case, the issue is rather kernel / FW power management, than compute driver / GPU job submission. |
So clpeak does appear to try and fully saturate the GPU when determining launch latency and uses multiple iterations: https://github.com/krrishnarraj/clpeak/blob/master/src/kernel_latency.cpp#L11 and the results I get are reproducible. As for my real-world test case the clocks are pinned at ~2050 Mhz according to |
Could you give utilization & freq values for both GPUs which latency you are comparing? You can select which card is shown with the |
For my integrate GPU the clock speed is fixed throughout the entire case (which takes ~6 minutes or so) at ~1400 MHz give or take 5 MHz. The utilisation here is from |
Thanks! So although PS. On iGPUs, |
Hello FreddieWitherden and BA8F0D39 Github issue #600 is still open while answers are already delivered in the middle. Let’s summarize this topic and go for closure. For best and low latency in Arc/Flex dGFX family please consider using Intel OEM kernel which can be obtained following the instruction: https://dgpu-docs.intel.com/installation-guides/index.html: latest version as a package https://dgpu-docs.intel.com/releases/stable_555_20230124.html and source code to build on your own https://github.com/intel-gpu/intel-gpu-i915-backports . OEM kernel driver contains significant change of dispatch model, which allows enabling low latency submission by Compute-Runtime drivers. Direct submission is not available in upstream generic kernel, as mentioned for 6.1/6.2. With upstreamed 6.1/6.2 generic Linux kernel compute-runtime driver uses legacy submission model, it is same unified path for integrated and discrete Gfx devices. With legacy submission longer kernel latency timings are expected. |
Okay sounds good. Are there any plans to upstream the direct submission paths to the mainline Linux kernel? |
At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel. |
@BartusW |
Are you asking about VM-Bind (the answer is covered above) or just asking in general about kernel launch latency changes? |
@BartusW |
Will the new dispatch model be added into the upstream i915 and Xe KMD in the future? Or is it an OEM exclusive feature? |
Xe kernel driver is based on VM_BIND. For more info, see:
|
I've read somewhere that there will be no VM_BIND in i915, which will speed up the upstreaming of Xe KMD.
https://lists.freedesktop.org/archives/intel-gfx/2023-April/324237.html |
(Disclaimer: I'm not a driver developer, just a spectator, so this is just my clueless observation.) Doing major architectural change into 915 kernel driver seems practical impossibility to me because it supports 1.5 decades of different Intel GPU HW, I think more than what's listed on this page: https://en.wikipedia.org/wiki/Intel_Graphics_Technology That span covers multiple generations of user-space compute (OpenCL & SYCL), media, 3D (OpenGL/GLES & Vulkan) drivers (maybe 10 different driver version?), which all would need to be tested and validated, some on very old / hard to get HW, and after large change like that, there likely were still quite a few bugs that users would find only later on. And I do not not see the alternative of kernel driver dropping support for anything older than what the last user-space driver generation supports (i.e. Broadwell up), would be accepted by users either. |
Using clpeak with runtime 22.43.24595 on the integrated GPU on an i7-12700H CPU under Linux I find the kernel launch latency to be 42.46 us. This is around 10 times slower than can be expected for a recent discrete GPU from AMD/NVIDIA connected over PCIe. In general, one would expect integrated GPUs to have an advantage here since the CPU and GPU both share an LLC.
The text was updated successfully, but these errors were encountered: