-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TeamScan for CUDA, Pthreads, OpenMPTarget, HIP #3536
Conversation
jrmadsen
commented
Oct 28, 2020
- implemented team-level parallel scan in CUDA
impl scatch parallel_scan(..., N,...) {
my_sum;
for(int i = N/team_size*team_rank; ... ) {
f(i,my_sum,false);
}
offset = team.team_scan(my_sum);
my_sum = 0;
for(int i = N/team_size*team_rank; ... ) {
f(i,my_sum+offset,false);
}
} Cuda:
|
@crtrott I ended up with a slightly different implementation that you recommended. This appears to work as long as the team-size is a power of 2, otherwise the |
091ee96
to
0552e05
Compare
Retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We met today and discussed correctness issues. Jonathan is working on this.
- Added TestTeamScan.hpp - Renamed team_scan test to team_reduction_scan in TeamReductionScan due to naming conflict
@dalg24 @masterleinad Do either of y'all know why the Jenkins build failed? I searched for all instances of the word "error" and "fail" in the Jenkins log and basically the only "failure" was "script exited with error code 2" but the build and tests appear to be fine |
|
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minus the "exclude the test in the exclude list for SYCL" this looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HIP backend looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify the tolerance with floating point numbers
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
Retest this please. |
So intereseting question: should we just merge? Windows likely failing because of the MSVC stuff, Jenkins its the AMD node, and travis is one timeout ... |
Also: Jonathan you want to rewrite history or should I squash commit? |
Hi @jrmadsen @crtrott , I have code using something like kokkos/core/src/Cuda/Kokkos_Cuda_ReduceScan.hpp Lines 698 to 700 in e483144
(cuda-gdb) p blockDim.y
$4 = 224
(cuda-gdb) p blockDim.y - 1
$5 = 223
any idea what went wrong here? Does the code in this PR require the item counts |
Judging from the unit test cases, I guess this implementation only works if the work item counts to TeamThreadRange is power of 2? |
Yes |
Would it be difficult to support non-power-of-2 work counts? I can help implementing it with the CUDA backend at least. I have a lot of small loops of a few hundreds work counts and it would be a big hit in performance if forced to use power-of-2 loop, especially on the host because I need to convert all the host side to be power of 2 to be portable |
We will discuss this but in general, contributions are very welcome. |
@Char-Aznable We decided that we are interested in making non-power-2 team sizes work, see #4146. Any help in implementing that is very welcome! |
Great! I'll take a look at the code and see what I can do |