Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] DPC++ reduction library incorrect event profiling timing #2820

Open
huanghua1994 opened this issue Nov 25, 2020 · 14 comments
Open

[SYCL] DPC++ reduction library incorrect event profiling timing #2820

huanghua1994 opened this issue Nov 25, 2020 · 14 comments
Assignees

Comments

@huanghua1994
Copy link

Test file: https://github.com/huanghua1994/HPC_Playground/blob/master/SYCL/reduction_timing.cpp
Compiler version: git commit 140c0d0
Compiler configuration: buildbot/configure.py --cuda
Selected device: GTX 1070, CUDA version 11.0, driver version 455.38

Problem description:
When using the DPC++ reduction library for float type add reduction, info::event_profiling::command_{start/end} returned incorrect timings (too small). For int type add reduction, the timings are correct.

Sample output when using T = float:

$ ./reduction_timing.exe 1048576 128
n = 1048576, b = 128
Runtime with reduction    = 64513 ns
Runtime without reduction = 175105 ns

Sample output when using T = int:

$ ./reduction_timing.exe 1048576 128
n = 1048576, b = 128
Runtime with reduction    = 2096161 ns
Runtime without reduction = 175102 ns
@huanghua1994 huanghua1994 changed the title [SYCL] Incorrect event profiling timing [SYCL] DPC++ reduction library incorrect event profiling timing Nov 25, 2020
@zjin-lcf
Copy link
Contributor

error: use of undeclared identifier 'n_blk'
std::cout << "n = " << n << ", b = " << b << ", n_blk = " << n_blk << std::endl;

@huanghua1994
Copy link
Author

huanghua1994 commented Nov 27, 2020

Oops, the n_blk bug is fixed now

@zjin-lcf
Copy link
Contributor

zjin-lcf commented Nov 27, 2020

Following your report, I add what I observe:

Profiling shows there are three kernels for the floating-point reduction and one kernel for the integer reduction.
The profiling APIs measure only one kernel on an Intel iGPU using the OpenCL backend.
On an Nvidia GPU, it is difficult to understand the profiling result compared to the results of nvprofiling

Runtime with    reduction = 14303 ns
==57761== Profiling application: ./test 1048576 128
==57761== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   91.76%  130.02us         1  130.02us  130.02us  130.02us  _ZTSN2cl4sycl6ONEAPI6detail28__sycl_reduction_main_kernelIZZ4mainENKUlRNS0_7handlerEE                                         32_30clES5_EUlNS0_7nd_itemILi1EEERT_E37_10Lb1ELb1ENS0_8accessorIfLi0ELNS0_6access4modeE1026ELNSD_6targetE2014ELNSD_11placeholderE0ENS1_22accessor_property_listIJE                                         EEEEEE
                    5.24%  7.4240us         1  7.4240us  7.4240us  7.4240us  _ZTSN2cl4sycl6ONEAPI6detail27__sycl_reduction_aux_kernelIZZ4mainENKUlRNS0_7handlerEE3                                         2_30clES5_EUlNS0_7nd_itemILi1EEERT_E37_10Lb1ELb1ENS0_8accessorIfLi0ELNS0_6access4modeE1026ELNSD_6targetE2014ELNSD_11placeholderE0ENS1_22accessor_property_listIJEE                                         EEEEE
                    3.00%  4.2560us         1  4.2560us  4.2560us  4.2560us  _ZTSN2cl4sycl6ONEAPI6detail27__sycl_reduction_aux_kernelIZZ4mainENKUlRNS0_7handlerEE3                                         2_30clES5_EUlNS0_7nd_itemILi1EEERT_E37_10Lb1ELb1EPfEE
      API calls:   89.91%  215.31ms         1  215.31ms  215.31ms  215.31ms  cuCtxCreate
                    8.99%  21.520ms         2  10.760ms  371.10us  21.149ms  cuMemAllocManaged
                    0.39%  941.27us         3  313.76us  13.069us  912.82us  cuLaunchKernel
                    0.27%  657.65us         1  657.65us  657.65us  657.65us  cuModuleLoadDataEx
                    0.27%  647.12us         2  323.56us  89.447us  557.67us  cuMemFree
                    0.08%  196.46us         2  98.227us  7.6220us  188.83us  cuMemAlloc
                    0.03%  69.243us         2  34.621us  22.236us  47.007us  cuDeviceGetName
                    0.02%  50.095us        10  5.0090us  2.6110us  9.6370us  cuEventRecord
                    0.01%  16.144us         3  5.3810us  1.5580us  12.565us  cuEventSynchronize
                    0.01%  13.085us         1  13.085us  13.085us  13.085us  cuStreamCreate
                    0.00%  11.241us        10  1.1240us     367ns  4.6120us  cuEventCreate
                    0.00%  10.388us         2  5.1940us  3.1410us  7.2470us  cuDeviceGetPCIBusId
                    0.00%  7.8110us        21     371ns     163ns     653ns  cuCtxGetCurrent
                    0.00%  7.7100us         6  1.2850us     422ns  3.8320us  cuModuleGetFunction
                    0.00%  6.9590us        19     366ns     174ns  1.6670us  cuDeviceGetAttribute
                    0.00%  5.9870us         2  2.9930us  1.6430us  4.3440us  cuStreamWaitEvent
                    0.00%  5.3330us         2  2.6660us  2.2510us  3.0820us  cuEventElapsedTime
                    0.00%  3.8590us         2  1.9290us     606ns  3.2530us  cuPointerGetAttribute
                    0.00%  1.7440us         4     436ns     156ns     785ns  cuDeviceGet
                    0.00%  1.5870us         3     529ns     283ns     877ns  cuDeviceGetCount

@AlexeySachkov AlexeySachkov added the cuda CUDA back-end label Feb 2, 2021
@JackAKirk
Copy link
Contributor

@huanghua1994 @zjin-lcf @bader

Hi, thanks for posting this. I find that the results of the timing depends a lot on the device used. But in any case, if you try your benchmark using the latest DPC++ commit you may see that the T = float case behaves differently than it did before whereas the T = int case is probably unchanged. This is because the float reduction now also uses atomic operations for adding to the final reduction variable rather than an auxillary kernel.
However, as zjin-lcf mentions the reduction timing that you will record still does not correspond with the main reduction kernel. Instead it corresponds to a call to reduSaveFinalResultToUserMem that is made for the USM case but not for the case of using buffers. If you switch to using buffers for your 'data' and 'sum' variables you should see that the reduction event timing gives the same value as the main reduction kernel from nvprof. The general problem encountered here with event timings using the cuda backend appears to be that only the timing for the latest kernel that was enqueued during the event, 'ev1', is recorded, i.e. reduSaveFinalResultToUserMem, rather than the kernel which takes the most time, i.e. sycl_reduction_main_kernel. We are currently investigating this issue.

@zjin-lcf
Copy link
Contributor

zjin-lcf commented Aug 5, 2021 via email

@romanovvlad
Copy link
Contributor

@v-klochkov FYI

@JackAKirk
Copy link
Contributor

JackAKirk commented Aug 9, 2021

Update:
I think it is a reduction specific issue and I expect that it affects all backends, not just cuda. In short the issue is that the submission of the command group for the reduction parallel_for creates more than one event for the USM case and the event, 'ev1', that is returned by the call to submit is the latest event created during the call to submit that corresponds to the call to reduSaveFinalResultToUserMem.

I believe that the resolution for this issue will be discussed internally.

@bader
Copy link
Contributor

bader commented Aug 18, 2021

@

Update:
I think it is a reduction specific issue and I expect that it affects all backends, not just cuda. In short the issue is that the submission of the command group for the reduction parallel_for creates more than one event for the USM case and the event, 'ev1', that is returned by the call to submit is the latest event created during the call to submit that corresponds to the call to reduSaveFinalResultToUserMem.

I believe that the resolution for this issue will be discussed internally.

@JackAKirk, thanks for the analysis. @v-klochkov, please, take a look at this issue.

@bader bader removed the cuda CUDA back-end label Aug 18, 2021
@v-klochkov
Copy link
Contributor

The timing/profiling using the events happened to be a bad surprise, it was not taken into account during development of the reduction. Currently the reduction is using more than 1 kernels per 1 parallel_for invocation to get better performance.
Fixing this issue may require significant re-work of the reduction implementation.

@Pennycook
Copy link
Contributor

The timing/profiling using the events happened to be a bad surprise, it was not taken into account during development of the reduction. Currently the reduction is using more than 1 kernels per 1 parallel_for invocation to get better performance.
Fixing this issue may require significant re-work of the reduction implementation.

I don't think this is as big a problem as you expect -- I would argue that the necessary heavy lifting (if there is any) will be in the event class rather than the reduction class.

Using multiple parallel_for invocations as an implementation detail behind our reductions is fine, as long as we can provide a sycl::event object that represents all of those invocations. Waiting on the event returned from a reduction should wait on all kernels submitted to the device, and querying the profiling information for that event should represent submission of the whole command-group. The SYCL 2020 specification explicitly says that a sycl::event may map to several OpenCL backend events -- there's no guarantee that a single command group maps to a single kernel!

@AerialMantis AerialMantis added cuda CUDA back-end runtime Runtime library related issue and removed cuda CUDA back-end runtime Runtime library related issue labels Aug 23, 2021
@KornevNikita
Copy link
Contributor

KornevNikita commented May 17, 2024

Hi! There have been no updates for at least the last 60 days, though the ticket has assignee(s).

@v-klochkov, could I ask you to take one of the following actions? :)

  • Please provide an update if you have any (or just a small comment if you don't have any yet).
  • OR mark this issue with the 'confirmed' label if you have confirmed the problem/request and our team should work on it.
  • OR close the issue if it has been resolved.
  • OR take any other suitable action.

Thanks!

@v-klochkov
Copy link
Contributor

@steffenlarsen , @aelovikov-intel - please re-assign/dispatch. The reduction implementation has got lots of extra changes after I initially implemented it few years ago and left Scalar SYCL squad.

Copy link
Contributor

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@steffenlarsen, could you please take one of the following actions:

  • provide an update if you have any
  • unassign yourself if you're not looking / going to look into this issue
  • mark this issue with the 'confirmed' label if you have confirmed the problem/request and our team should work on it
  • close the issue if it has been resolved
  • take any other suitable action.

Thanks!

1 similar comment
Copy link
Contributor

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@steffenlarsen, could you please take one of the following actions:

  • provide an update if you have any
  • unassign yourself if you're not looking / going to look into this issue
  • mark this issue with the 'confirmed' label if you have confirmed the problem/request and our team should work on it
  • close the issue if it has been resolved
  • take any other suitable action.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests