-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] DPC++ reduction library incorrect event profiling timing #2820
Comments
error: use of undeclared identifier 'n_blk' |
Oops, the |
Following your report, I add what I observe: Profiling shows there are three kernels for the floating-point reduction and one kernel for the integer reduction.
|
@huanghua1994 @zjin-lcf @bader Hi, thanks for posting this. I find that the results of the timing depends a lot on the device used. But in any case, if you try your benchmark using the latest DPC++ commit you may see that the T = float case behaves differently than it did before whereas the T = int case is probably unchanged. This is because the float reduction now also uses atomic operations for adding to the final reduction variable rather than an auxillary kernel. |
Thank you for your feedback. Look forward to your updates.
From: JackAKirk ***@***.***>
Sent: Thursday, August 5, 2021 9:46 AM
To: intel/llvm ***@***.***>
Cc: Jin, Zheming ***@***.***>; Mention ***@***.***>
Subject: [EXTERNAL] Re: [intel/llvm] [SYCL] DPC++ reduction library incorrect event profiling timing (#2820)
@huanghua1994<https://github.com/huanghua1994> @zjin-lcf<https://github.com/zjin-lcf> @bader<https://github.com/bader>
Hi, thanks for posting this. I find that the results of the timing depends a lot on the device used. But in any case, if you try your benchmark using the latest DPC++ commit you may see that the T = float case behaves differently than it did before whereas the T = int case is probably unchanged. This is because the float reduction now also uses atomic operations for adding to the final reduction variable rather than an auxillary kernel.
However, as zjin-lcf mentions the reduction timing that you will record still does not correspond with the main reduction kernel. Instead it corresponds to a call to reduSaveFinalResultToUserMem that is made for the USM case but not for the case of using buffers. If you switch to using buffers for your 'data' and 'sum' variables you should see that the reduction event timing gives the same value as the main reduction kernel from nvprof. The general problem encountered here with event timings using the cuda backend appears to be that only the timing for the latest kernel that was enqueued during the event, 'ev1', is recorded, i.e. reduSaveFinalResultToUserMem, rather than the kernel which takes the most time, i.e. sycl_reduction_main_kernel. We are currently investigating this issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#2820 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANUU4DUS6ZRBYZREILAMKETT3KI2ZANCNFSM4UC5XOIA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.
|
@v-klochkov FYI |
Update: I believe that the resolution for this issue will be discussed internally. |
@
@JackAKirk, thanks for the analysis. @v-klochkov, please, take a look at this issue. |
The timing/profiling using the events happened to be a bad surprise, it was not taken into account during development of the reduction. Currently the reduction is using more than 1 kernels per 1 parallel_for invocation to get better performance. |
I don't think this is as big a problem as you expect -- I would argue that the necessary heavy lifting (if there is any) will be in the Using multiple |
Hi! There have been no updates for at least the last 60 days, though the ticket has assignee(s). @v-klochkov, could I ask you to take one of the following actions? :)
Thanks! |
@steffenlarsen , @aelovikov-intel - please re-assign/dispatch. The reduction implementation has got lots of extra changes after I initially implemented it few years ago and left Scalar SYCL squad. |
Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s). @steffenlarsen, could you please take one of the following actions:
Thanks! |
1 similar comment
Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s). @steffenlarsen, could you please take one of the following actions:
Thanks! |
Test file: https://github.com/huanghua1994/HPC_Playground/blob/master/SYCL/reduction_timing.cpp
Compiler version: git commit 140c0d0
Compiler configuration:
buildbot/configure.py --cuda
Selected device: GTX 1070, CUDA version 11.0, driver version 455.38
Problem description:
When using the DPC++ reduction library for float type add reduction,
info::event_profiling::command_{start/end}
returned incorrect timings (too small). For int type add reduction, the timings are correct.Sample output when using
T = float
:Sample output when using
T = int
:The text was updated successfully, but these errors were encountered: