-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix incorrect offset in CUDA/HIP parallel scan for < 4 byte types #5555
fix incorrect offset in CUDA/HIP parallel scan for < 4 byte types #5555
Conversation
@cwsmith this should solve the issue though there is additional work to be done in ensuring the correct type is used for the internal parallel scan when using `Kokkos::Experimental::inclusive_scan" |
@@ -1026,7 +1061,8 @@ class ParallelScanWithTotal<FunctorType, Kokkos::RangePolicy<Traits...>, | |||
if (!m_result_ptr_device_accessible) | |||
DeepCopy<HostSpace, CudaSpace, Cuda>( | |||
m_policy.space(), m_result_ptr, | |||
m_scratch_space + (grid_x - 1) * size / sizeof(int), size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually know where int
came from in the original, but sizeof(word_size_type)
is important here otherwise the offset of the result in the shared memory buffer will be incorrect (i.e. 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does any corresponding change need to be done for HIP?
From looking at the code it certainly appears so -- in fact the logic bug (size < |
I think we should make the same changes for |
Filed #5556 |
Yeah lets get the HIP thing fixed here too, since the test doesn't pass anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please beef up the description of the PR and relate it to #4156
core/unit_test/TestScan.hpp
Outdated
// We should have a nice count from 0 to 1... | ||
if (update != static_cast<value_type>(i)) { | ||
int fail = errors()++; | ||
|
||
// Limit the amount of output | ||
if (fail < 20) { | ||
KOKKOS_IMPL_DO_NOT_USE_PRINTF( | ||
"TestSmallSizeTypeScan(%d) = %ld != %ld\n", i, | ||
static_cast<long>(update), static_cast<long>(i)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not mix computation and verification. Launch another kernel for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, should I remove from the existing test?
I have a patch that fixes the problem with HIP. How should we proceed? |
Please go ahead and push to this PR |
core/unit_test/TestScan.hpp
Outdated
KOKKOS_INLINE_FUNCTION | ||
void init(value_type& update) const { update = 0; } | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KOKKOS_INLINE_FUNCTION | |
void init(value_type& update) const { update = 0; } |
The HPX build is now passing. On the ORNL Jenkins CI server, there is one (likely unrelated) CUDA timing-based failure which I am willing to ignore. |
…ecializations Applied review suggestion to reduce the diff
…ition of word_size_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Just nitpicking on one thing.
…#5555) * kokkos#5545: tests: add reproducer test for kokkos#5545 * kokkos#5545: cuda: use word_size_type for ParallelScan * kokkos#5545: tests: add reproducer for offset issue with ParallelScanWithTotal * kokkos#5545: cuda: add word_size_type for ParallelScanWithTotal * fix formatting * cuda: fix comments for word_size_type * tests: fix mismatched types in Scan test assert * Fix HIP parallel_scan when using types < 4 bytes * Fix indentation * Replace class with typename * tests: re-use old scan test for small types; add extra parameter to avoid overflow * tests: remove fence and change ImbalanceSz to value_type * Fix format * Re-introduce size_type member type in ParallelScan[WithTotal] CUDA specializations Applied review suggestion to reduce the diff * Nitpick in HIP too prefere size_type alias to HIP::size_type in definition of word_size_type * Apply Dongs suggestion and revert one more change I missed Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com> Co-authored-by: Damien L-G <dalg24+github@gmail.com> Co-authored-by: Damien L-G <dalg24@gmail.com>
#5607) * fix incorrect offset in cuda parallel scan for < 4 byte types (#5555) * #5545: tests: add reproducer test for #5545 * #5545: cuda: use word_size_type for ParallelScan * #5545: tests: add reproducer for offset issue with ParallelScanWithTotal * #5545: cuda: add word_size_type for ParallelScanWithTotal * fix formatting * cuda: fix comments for word_size_type * tests: fix mismatched types in Scan test assert * Fix HIP parallel_scan when using types < 4 bytes * Fix indentation * Replace class with typename * tests: re-use old scan test for small types; add extra parameter to avoid overflow * tests: remove fence and change ImbalanceSz to value_type * Fix format * Re-introduce size_type member type in ParallelScan[WithTotal] CUDA specializations Applied review suggestion to reduce the diff * Nitpick in HIP too prefere size_type alias to HIP::size_type in definition of word_size_type * Apply Dongs suggestion and revert one more change I missed Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com> Co-authored-by: Damien L-G <dalg24+github@gmail.com> Co-authored-by: Damien L-G <dalg24@gmail.com> * Remove stray comment for format accident * Fix bug spotted by Nic * cuda: fix missing value_type * HIP: fix missing value_type alias from cherry-pick Co-authored-by: Nicolas Morales <nmmoral@sandia.gov> Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com>
…#5555) * kokkos#5545: tests: add reproducer test for kokkos#5545 * kokkos#5545: cuda: use word_size_type for ParallelScan * kokkos#5545: tests: add reproducer for offset issue with ParallelScanWithTotal * kokkos#5545: cuda: add word_size_type for ParallelScanWithTotal * fix formatting * cuda: fix comments for word_size_type * tests: fix mismatched types in Scan test assert * Fix HIP parallel_scan when using types < 4 bytes * Fix indentation * Replace class with typename * tests: re-use old scan test for small types; add extra parameter to avoid overflow * tests: remove fence and change ImbalanceSz to value_type * Fix format * Re-introduce size_type member type in ParallelScan[WithTotal] CUDA specializations Applied review suggestion to reduce the diff * Nitpick in HIP too prefere size_type alias to HIP::size_type in definition of word_size_type * Apply Dongs suggestion and revert one more change I missed Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com> Co-authored-by: Damien L-G <dalg24+github@gmail.com> Co-authored-by: Damien L-G <dalg24@gmail.com>
…#5555) * kokkos#5545: tests: add reproducer test for kokkos#5545 * kokkos#5545: cuda: use word_size_type for ParallelScan * kokkos#5545: tests: add reproducer for offset issue with ParallelScanWithTotal * kokkos#5545: cuda: add word_size_type for ParallelScanWithTotal * fix formatting * cuda: fix comments for word_size_type * tests: fix mismatched types in Scan test assert * Fix HIP parallel_scan when using types < 4 bytes * Fix indentation * Replace class with typename * tests: re-use old scan test for small types; add extra parameter to avoid overflow * tests: remove fence and change ImbalanceSz to value_type * Fix format * Re-introduce size_type member type in ParallelScan[WithTotal] CUDA specializations Applied review suggestion to reduce the diff * Nitpick in HIP too prefere size_type alias to HIP::size_type in definition of word_size_type * Apply Dongs suggestion and revert one more change I missed Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com> Co-authored-by: Damien L-G <dalg24+github@gmail.com> Co-authored-by: Damien L-G <dalg24@gmail.com>
Fixes #5545