-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checks for shmem usage in parallel_reduce #4548
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the test a general test for all backends? I.e. does this work with HIP/SYCL (do they have implemented array reductions actually?)
Array reductions are implemented for |
I only looked into the existing array reduction size limits in |
I am not sure that there is a completely generic way of figuring out max shmem size. We would probably need to do a function you hand the execution space instance, and then overloads for that in the test where you in fact use semi magic numbers, or low level backend specific functions to figure it out (i.e. call raw CUDA/HIP/SYCL functionality) |
6e54d63
to
bcc14ad
Compare
Please fix:
|
100cf44
to
d1d6ecd
Compare
The check will throw if the expected size of the reduced view exceeds the internal shmem limit
@crtrott, as we discussed before, I will open another issue to convert this Cuda unit-test to a general test for other backends once this is merged. |
This is to resolve #4461
For CUDA build, the problem comes from ParallelReduce where it determines the necessary scratch space size and the block size for the reduced view. Starting from the size of 181 doubles for the reduced view, the calculated block size drops from 32 to 16, which seems to cause cuda illegal memory access. Interestingly, for ParallelReduce that takes in a teampolicy, there already is a similar check that uses the teamsize instead of the block size to verify the similar condition. So, to keep the conditions of the throws as consistently as possible across the policies, this commit puts in a simple function that checks if the calculated block size would be set below 32 because of the internal max shared memory size per block.
For HIP build, starting from 125 doubles, the calculated block size drops to 0. And there already is a check in HIP::ParallelReduce that throws if the calculated block size becomes 0, which is what was observed in the original issue post.