Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix incorrect offset in CUDA/HIP parallel scan for < 4 byte types #5555

Conversation

nmm0
Copy link
Contributor

@nmm0 nmm0 commented Oct 13, 2022

Fixes #5545

@nmm0
Copy link
Contributor Author

nmm0 commented Oct 13, 2022

@cwsmith this should solve the issue though there is additional work to be done in ensuring the correct type is used for the internal parallel scan when using `Kokkos::Experimental::inclusive_scan"

@nmm0 nmm0 requested a review from crtrott October 13, 2022 19:39
@@ -1026,7 +1061,8 @@ class ParallelScanWithTotal<FunctorType, Kokkos::RangePolicy<Traits...>,
if (!m_result_ptr_device_accessible)
DeepCopy<HostSpace, CudaSpace, Cuda>(
m_policy.space(), m_result_ptr,
m_scratch_space + (grid_x - 1) * size / sizeof(int), size);
Copy link
Contributor Author

@nmm0 nmm0 Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually know where int came from in the original, but sizeof(word_size_type) is important here otherwise the offset of the result in the shared memory buffer will be incorrect (i.e. 0)

Copy link
Contributor

@PhilMiller PhilMiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does any corresponding change need to be done for HIP?

@masterleinad masterleinad added Blocks Promotion Overview issue for release-blocking bugs Patch Release labels Oct 13, 2022
@nmm0
Copy link
Contributor Author

nmm0 commented Oct 13, 2022

Does any corresponding change need to be done for HIP?

From looking at the code it certainly appears so -- in fact the logic bug (size < HIP::size_type) appears to also be present in HIP's reduce

@masterleinad
Copy link
Contributor

I think we should make the same changes for HIP.

@nmm0
Copy link
Contributor Author

nmm0 commented Oct 13, 2022

I think we should make the same changes for HIP.

Filed #5556

@crtrott
Copy link
Member

crtrott commented Oct 14, 2022

Yeah lets get the HIP thing fixed here too, since the test doesn't pass anyway.

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please beef up the description of the PR and relate it to #4156

Comment on lines 204 to 214
// We should have a nice count from 0 to 1...
if (update != static_cast<value_type>(i)) {
int fail = errors()++;

// Limit the amount of output
if (fail < 20) {
KOKKOS_IMPL_DO_NOT_USE_PRINTF(
"TestSmallSizeTypeScan(%d) = %ld != %ld\n", i,
static_cast<long>(update), static_cast<long>(i));
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not mix computation and verification. Launch another kernel for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, should I remove from the existing test?

core/unit_test/TestScan.hpp Outdated Show resolved Hide resolved
@ajpowelsnl ajpowelsnl mentioned this pull request Oct 17, 2022
@Rombur
Copy link
Member

Rombur commented Oct 21, 2022

I have a patch that fixes the problem with HIP. How should we proceed?

@dalg24
Copy link
Member

dalg24 commented Oct 21, 2022

Please go ahead and push to this PR

core/unit_test/TestScan.hpp Outdated Show resolved Hide resolved
Comment on lines 219 to 221
KOKKOS_INLINE_FUNCTION
void init(value_type& update) const { update = 0; }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
KOKKOS_INLINE_FUNCTION
void init(value_type& update) const { update = 0; }

core/unit_test/TestScan.hpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_Parallel_Range.hpp Show resolved Hide resolved
core/unit_test/TestScan.hpp Outdated Show resolved Hide resolved
core/unit_test/TestScan.hpp Outdated Show resolved Hide resolved
core/unit_test/TestScan.hpp Outdated Show resolved Hide resolved
@dalg24
Copy link
Member

dalg24 commented Oct 31, 2022

The HPX build is now passing. On the ORNL Jenkins CI server, there is one (likely unrelated) CUDA timing-based failure which I am willing to ignore.
We need one more approval though.

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good to me.

Copy link
Contributor

@ldh4 ldh4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just nitpicking on one thing.

core/src/Cuda/Kokkos_Cuda_Parallel_Range.hpp Outdated Show resolved Hide resolved
@dalg24 dalg24 merged commit 3c842fb into kokkos:develop Oct 31, 2022
dalg24 added a commit to dalg24/kokkos that referenced this pull request Oct 31, 2022
…#5555)

* kokkos#5545: tests: add reproducer test for kokkos#5545

* kokkos#5545: cuda: use word_size_type for ParallelScan

* kokkos#5545: tests: add reproducer for offset issue with ParallelScanWithTotal

* kokkos#5545: cuda: add word_size_type for ParallelScanWithTotal

* fix formatting

* cuda: fix comments for word_size_type

* tests: fix mismatched types in Scan test assert

* Fix HIP parallel_scan when using types < 4 bytes

* Fix indentation

* Replace class with typename

* tests: re-use old scan test for small types; add extra parameter to avoid overflow

* tests: remove fence and change ImbalanceSz to value_type

* Fix format

* Re-introduce size_type member type in ParallelScan[WithTotal] CUDA specializations

Applied review suggestion to reduce the diff

* Nitpick in HIP too prefere size_type alias to HIP::size_type in definition of word_size_type

* Apply Dongs suggestion and revert one more change I missed

Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com>
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
Co-authored-by: Damien L-G <dalg24@gmail.com>
dalg24 added a commit that referenced this pull request Nov 2, 2022
#5607)

* fix incorrect offset in cuda parallel scan for < 4 byte types (#5555)

* #5545: tests: add reproducer test for #5545

* #5545: cuda: use word_size_type for ParallelScan

* #5545: tests: add reproducer for offset issue with ParallelScanWithTotal

* #5545: cuda: add word_size_type for ParallelScanWithTotal

* fix formatting

* cuda: fix comments for word_size_type

* tests: fix mismatched types in Scan test assert

* Fix HIP parallel_scan when using types < 4 bytes

* Fix indentation

* Replace class with typename

* tests: re-use old scan test for small types; add extra parameter to avoid overflow

* tests: remove fence and change ImbalanceSz to value_type

* Fix format

* Re-introduce size_type member type in ParallelScan[WithTotal] CUDA specializations

Applied review suggestion to reduce the diff

* Nitpick in HIP too prefere size_type alias to HIP::size_type in definition of word_size_type

* Apply Dongs suggestion and revert one more change I missed

Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com>
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
Co-authored-by: Damien L-G <dalg24@gmail.com>

* Remove stray comment for format accident

* Fix bug spotted by Nic

* cuda: fix missing value_type

* HIP: fix missing value_type alias from cherry-pick

Co-authored-by: Nicolas Morales <nmmoral@sandia.gov>
Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com>
cz4rs pushed a commit to cz4rs/kokkos that referenced this pull request Nov 10, 2022
…#5555)

* kokkos#5545: tests: add reproducer test for kokkos#5545

* kokkos#5545: cuda: use word_size_type for ParallelScan

* kokkos#5545: tests: add reproducer for offset issue with ParallelScanWithTotal

* kokkos#5545: cuda: add word_size_type for ParallelScanWithTotal

* fix formatting

* cuda: fix comments for word_size_type

* tests: fix mismatched types in Scan test assert

* Fix HIP parallel_scan when using types < 4 bytes

* Fix indentation

* Replace class with typename

* tests: re-use old scan test for small types; add extra parameter to avoid overflow

* tests: remove fence and change ImbalanceSz to value_type

* Fix format

* Re-introduce size_type member type in ParallelScan[WithTotal] CUDA specializations

Applied review suggestion to reduce the diff

* Nitpick in HIP too prefere size_type alias to HIP::size_type in definition of word_size_type

* Apply Dongs suggestion and revert one more change I missed

Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com>
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
Co-authored-by: Damien L-G <dalg24@gmail.com>
tcclevenger pushed a commit to tcclevenger/kokkos that referenced this pull request Dec 5, 2022
…#5555)

* kokkos#5545: tests: add reproducer test for kokkos#5545

* kokkos#5545: cuda: use word_size_type for ParallelScan

* kokkos#5545: tests: add reproducer for offset issue with ParallelScanWithTotal

* kokkos#5545: cuda: add word_size_type for ParallelScanWithTotal

* fix formatting

* cuda: fix comments for word_size_type

* tests: fix mismatched types in Scan test assert

* Fix HIP parallel_scan when using types < 4 bytes

* Fix indentation

* Replace class with typename

* tests: re-use old scan test for small types; add extra parameter to avoid overflow

* tests: remove fence and change ImbalanceSz to value_type

* Fix format

* Re-introduce size_type member type in ParallelScan[WithTotal] CUDA specializations

Applied review suggestion to reduce the diff

* Nitpick in HIP too prefere size_type alias to HIP::size_type in definition of word_size_type

* Apply Dongs suggestion and revert one more change I missed

Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com>
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
Co-authored-by: Damien L-G <dalg24@gmail.com>
@PhilMiller PhilMiller changed the title fix incorrect offset in cuda parallel scan for < 4 byte types fix incorrect offset in CUDA/HIP parallel scan for < 4 byte types Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocks Promotion Overview issue for release-blocking bugs Patch Release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants