[SYCL] DPC++ reduction library incorrect event profiling timing #2820

huanghua1994 · 2020-11-25T21:45:22Z

Test file: https://github.com/huanghua1994/HPC_Playground/blob/master/SYCL/reduction_timing.cpp
Compiler version: git commit 140c0d0
Compiler configuration: buildbot/configure.py --cuda
Selected device: GTX 1070, CUDA version 11.0, driver version 455.38

Problem description:
When using the DPC++ reduction library for float type add reduction, info::event_profiling::command_{start/end} returned incorrect timings (too small). For int type add reduction, the timings are correct.

Sample output when using T = float:

$ ./reduction_timing.exe 1048576 128
n = 1048576, b = 128
Runtime with reduction    = 64513 ns
Runtime without reduction = 175105 ns

Sample output when using T = int:

$ ./reduction_timing.exe 1048576 128
n = 1048576, b = 128
Runtime with reduction    = 2096161 ns
Runtime without reduction = 175102 ns

The text was updated successfully, but these errors were encountered:

zjin-lcf · 2020-11-27T12:11:56Z

error: use of undeclared identifier 'n_blk'
std::cout << "n = " << n << ", b = " << b << ", n_blk = " << n_blk << std::endl;

huanghua1994 · 2020-11-27T12:23:39Z

Oops, the n_blk bug is fixed now

zjin-lcf · 2020-11-27T12:50:25Z

Following your report, I add what I observe:

Profiling shows there are three kernels for the floating-point reduction and one kernel for the integer reduction.
The profiling APIs measure only one kernel on an Intel iGPU using the OpenCL backend.
On an Nvidia GPU, it is difficult to understand the profiling result compared to the results of nvprofiling

Runtime with    reduction = 14303 ns
==57761== Profiling application: ./test 1048576 128
==57761== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   91.76%  130.02us         1  130.02us  130.02us  130.02us  _ZTSN2cl4sycl6ONEAPI6detail28__sycl_reduction_main_kernelIZZ4mainENKUlRNS0_7handlerEE                                         32_30clES5_EUlNS0_7nd_itemILi1EEERT_E37_10Lb1ELb1ENS0_8accessorIfLi0ELNS0_6access4modeE1026ELNSD_6targetE2014ELNSD_11placeholderE0ENS1_22accessor_property_listIJE                                         EEEEEE
                    5.24%  7.4240us         1  7.4240us  7.4240us  7.4240us  _ZTSN2cl4sycl6ONEAPI6detail27__sycl_reduction_aux_kernelIZZ4mainENKUlRNS0_7handlerEE3                                         2_30clES5_EUlNS0_7nd_itemILi1EEERT_E37_10Lb1ELb1ENS0_8accessorIfLi0ELNS0_6access4modeE1026ELNSD_6targetE2014ELNSD_11placeholderE0ENS1_22accessor_property_listIJEE                                         EEEEE
                    3.00%  4.2560us         1  4.2560us  4.2560us  4.2560us  _ZTSN2cl4sycl6ONEAPI6detail27__sycl_reduction_aux_kernelIZZ4mainENKUlRNS0_7handlerEE3                                         2_30clES5_EUlNS0_7nd_itemILi1EEERT_E37_10Lb1ELb1EPfEE
      API calls:   89.91%  215.31ms         1  215.31ms  215.31ms  215.31ms  cuCtxCreate
                    8.99%  21.520ms         2  10.760ms  371.10us  21.149ms  cuMemAllocManaged
                    0.39%  941.27us         3  313.76us  13.069us  912.82us  cuLaunchKernel
                    0.27%  657.65us         1  657.65us  657.65us  657.65us  cuModuleLoadDataEx
                    0.27%  647.12us         2  323.56us  89.447us  557.67us  cuMemFree
                    0.08%  196.46us         2  98.227us  7.6220us  188.83us  cuMemAlloc
                    0.03%  69.243us         2  34.621us  22.236us  47.007us  cuDeviceGetName
                    0.02%  50.095us        10  5.0090us  2.6110us  9.6370us  cuEventRecord
                    0.01%  16.144us         3  5.3810us  1.5580us  12.565us  cuEventSynchronize
                    0.01%  13.085us         1  13.085us  13.085us  13.085us  cuStreamCreate
                    0.00%  11.241us        10  1.1240us     367ns  4.6120us  cuEventCreate
                    0.00%  10.388us         2  5.1940us  3.1410us  7.2470us  cuDeviceGetPCIBusId
                    0.00%  7.8110us        21     371ns     163ns     653ns  cuCtxGetCurrent
                    0.00%  7.7100us         6  1.2850us     422ns  3.8320us  cuModuleGetFunction
                    0.00%  6.9590us        19     366ns     174ns  1.6670us  cuDeviceGetAttribute
                    0.00%  5.9870us         2  2.9930us  1.6430us  4.3440us  cuStreamWaitEvent
                    0.00%  5.3330us         2  2.6660us  2.2510us  3.0820us  cuEventElapsedTime
                    0.00%  3.8590us         2  1.9290us     606ns  3.2530us  cuPointerGetAttribute
                    0.00%  1.7440us         4     436ns     156ns     785ns  cuDeviceGet
                    0.00%  1.5870us         3     529ns     283ns     877ns  cuDeviceGetCount

JackAKirk · 2021-08-05T13:46:09Z

@huanghua1994 @zjin-lcf @bader

Hi, thanks for posting this. I find that the results of the timing depends a lot on the device used. But in any case, if you try your benchmark using the latest DPC++ commit you may see that the T = float case behaves differently than it did before whereas the T = int case is probably unchanged. This is because the float reduction now also uses atomic operations for adding to the final reduction variable rather than an auxillary kernel.
However, as zjin-lcf mentions the reduction timing that you will record still does not correspond with the main reduction kernel. Instead it corresponds to a call to reduSaveFinalResultToUserMem that is made for the USM case but not for the case of using buffers. If you switch to using buffers for your 'data' and 'sum' variables you should see that the reduction event timing gives the same value as the main reduction kernel from nvprof. The general problem encountered here with event timings using the cuda backend appears to be that only the timing for the latest kernel that was enqueued during the event, 'ev1', is recorded, i.e. reduSaveFinalResultToUserMem, rather than the kernel which takes the most time, i.e. sycl_reduction_main_kernel. We are currently investigating this issue.

zjin-lcf · 2021-08-05T20:01:36Z

Thank you for your feedback. Look forward to your updates. From: JackAKirk ***@***.***> Sent: Thursday, August 5, 2021 9:46 AM To: intel/llvm ***@***.***> Cc: Jin, Zheming ***@***.***>; Mention ***@***.***> Subject: [EXTERNAL] Re: [intel/llvm] [SYCL] DPC++ reduction library incorrect event profiling timing (#2820) @huanghua1994<https://github.com/huanghua1994> @zjin-lcf<https://github.com/zjin-lcf> @bader<https://github.com/bader> Hi, thanks for posting this. I find that the results of the timing depends a lot on the device used. But in any case, if you try your benchmark using the latest DPC++ commit you may see that the T = float case behaves differently than it did before whereas the T = int case is probably unchanged. This is because the float reduction now also uses atomic operations for adding to the final reduction variable rather than an auxillary kernel. However, as zjin-lcf mentions the reduction timing that you will record still does not correspond with the main reduction kernel. Instead it corresponds to a call to reduSaveFinalResultToUserMem that is made for the USM case but not for the case of using buffers. If you switch to using buffers for your 'data' and 'sum' variables you should see that the reduction event timing gives the same value as the main reduction kernel from nvprof. The general problem encountered here with event timings using the cuda backend appears to be that only the timing for the latest kernel that was enqueued during the event, 'ev1', is recorded, i.e. reduSaveFinalResultToUserMem, rather than the kernel which takes the most time, i.e. sycl_reduction_main_kernel. We are currently investigating this issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#2820 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANUU4DUS6ZRBYZREILAMKETT3KI2ZANCNFSM4UC5XOIA>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

romanovvlad · 2021-08-09T11:37:22Z

@v-klochkov FYI

JackAKirk · 2021-08-09T11:54:37Z

Update:
I think it is a reduction specific issue and I expect that it affects all backends, not just cuda. In short the issue is that the submission of the command group for the reduction parallel_for creates more than one event for the USM case and the event, 'ev1', that is returned by the call to submit is the latest event created during the call to submit that corresponds to the call to reduSaveFinalResultToUserMem.

I believe that the resolution for this issue will be discussed internally.

bader · 2021-08-18T07:56:51Z

@

Update:
I think it is a reduction specific issue and I expect that it affects all backends, not just cuda. In short the issue is that the submission of the command group for the reduction parallel_for creates more than one event for the USM case and the event, 'ev1', that is returned by the call to submit is the latest event created during the call to submit that corresponds to the call to reduSaveFinalResultToUserMem.

I believe that the resolution for this issue will be discussed internally.

@JackAKirk, thanks for the analysis. @v-klochkov, please, take a look at this issue.

v-klochkov · 2021-08-19T18:55:22Z

The timing/profiling using the events happened to be a bad surprise, it was not taken into account during development of the reduction. Currently the reduction is using more than 1 kernels per 1 parallel_for invocation to get better performance.
Fixing this issue may require significant re-work of the reduction implementation.

Pennycook · 2021-08-19T20:11:39Z

The timing/profiling using the events happened to be a bad surprise, it was not taken into account during development of the reduction. Currently the reduction is using more than 1 kernels per 1 parallel_for invocation to get better performance.
Fixing this issue may require significant re-work of the reduction implementation.

I don't think this is as big a problem as you expect -- I would argue that the necessary heavy lifting (if there is any) will be in the event class rather than the reduction class.

Using multiple parallel_for invocations as an implementation detail behind our reductions is fine, as long as we can provide a sycl::event object that represents all of those invocations. Waiting on the event returned from a reduction should wait on all kernels submitted to the device, and querying the profiling information for that event should represent submission of the whole command-group. The SYCL 2020 specification explicitly says that a sycl::event may map to several OpenCL backend events -- there's no guarantee that a single command group maps to a single kernel!

KornevNikita · 2024-05-17T11:36:39Z

Hi! There have been no updates for at least the last 60 days, though the ticket has assignee(s).

@v-klochkov, could I ask you to take one of the following actions? :)

Please provide an update if you have any (or just a small comment if you don't have any yet).
OR mark this issue with the 'confirmed' label if you have confirmed the problem/request and our team should work on it.
OR close the issue if it has been resolved.
OR take any other suitable action.

Thanks!

v-klochkov · 2024-05-17T19:19:54Z

@steffenlarsen , @aelovikov-intel - please re-assign/dispatch. The reduction implementation has got lots of extra changes after I initially implemented it few years ago and left Scalar SYCL squad.

github-actions · 2024-07-17T00:15:41Z

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@steffenlarsen, could you please take one of the following actions:

provide an update if you have any
unassign yourself if you're not looking / going to look into this issue
mark this issue with the 'confirmed' label if you have confirmed the problem/request and our team should work on it
close the issue if it has been resolved
take any other suitable action.

Thanks!

github-actions · 2024-09-15T00:18:58Z

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@steffenlarsen, could you please take one of the following actions:

provide an update if you have any
unassign yourself if you're not looking / going to look into this issue
mark this issue with the 'confirmed' label if you have confirmed the problem/request and our team should work on it
close the issue if it has been resolved
take any other suitable action.

Thanks!

huanghua1994 changed the title ~~[SYCL] Incorrect event profiling timing~~ [SYCL] DPC++ reduction library incorrect event profiling timing Nov 25, 2020

AlexeySachkov added the cuda CUDA back-end label Feb 2, 2021

bader assigned v-klochkov Aug 18, 2021

bader removed the cuda CUDA back-end label Aug 18, 2021

AerialMantis added cuda CUDA back-end runtime Runtime library related issue and removed cuda CUDA back-end runtime Runtime library related issue labels Aug 23, 2021

v-klochkov assigned steffenlarsen and unassigned v-klochkov May 17, 2024

steffenlarsen added the confirmed label Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] DPC++ reduction library incorrect event profiling timing #2820

[SYCL] DPC++ reduction library incorrect event profiling timing #2820

huanghua1994 commented Nov 25, 2020

zjin-lcf commented Nov 27, 2020

huanghua1994 commented Nov 27, 2020 •

edited

Loading

zjin-lcf commented Nov 27, 2020 •

edited

Loading

JackAKirk commented Aug 5, 2021

zjin-lcf commented Aug 5, 2021 via email

romanovvlad commented Aug 9, 2021

JackAKirk commented Aug 9, 2021 •

edited

Loading

bader commented Aug 18, 2021

v-klochkov commented Aug 19, 2021

Pennycook commented Aug 19, 2021

KornevNikita commented May 17, 2024 •

edited

Loading

v-klochkov commented May 17, 2024

github-actions bot commented Jul 17, 2024

github-actions bot commented Sep 15, 2024

[SYCL] DPC++ reduction library incorrect event profiling timing #2820

[SYCL] DPC++ reduction library incorrect event profiling timing #2820

Comments

huanghua1994 commented Nov 25, 2020

zjin-lcf commented Nov 27, 2020

huanghua1994 commented Nov 27, 2020 • edited Loading

zjin-lcf commented Nov 27, 2020 • edited Loading

JackAKirk commented Aug 5, 2021

zjin-lcf commented Aug 5, 2021 via email

romanovvlad commented Aug 9, 2021

JackAKirk commented Aug 9, 2021 • edited Loading

bader commented Aug 18, 2021

v-klochkov commented Aug 19, 2021

Pennycook commented Aug 19, 2021

KornevNikita commented May 17, 2024 • edited Loading

v-klochkov commented May 17, 2024

github-actions bot commented Jul 17, 2024

github-actions bot commented Sep 15, 2024

huanghua1994 commented Nov 27, 2020 •

edited

Loading

zjin-lcf commented Nov 27, 2020 •

edited

Loading

JackAKirk commented Aug 9, 2021 •

edited

Loading

KornevNikita commented May 17, 2024 •

edited

Loading