-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide valid default team size for SYCL #4481
Provide valid default team size for SYCL #4481
Conversation
} | ||
|
||
template <class FunctorType> | ||
int internal_team_size_recommended_reduce(const FunctorType& f) const { | ||
// FIXME_SYCL improve | ||
return internal_team_size_max_reduce(f); | ||
const int max_team_size_half = internal_team_size_max_reduce(f) / 2; | ||
int power_of_two = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is this trick which makes the power of 2 calculation O(1) and no jumps: https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using Kokkos::log2
instead (which uses intrinsics where available).
1e7f4ea
to
09f0d8f
Compare
09f0d8f
to
e498087
Compare
The test failure is unrelated. |
@@ -812,7 +816,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>, | |||
Kokkos::Impl::throw_runtime_exception(out.str()); | |||
} | |||
|
|||
if (m_team_size > m_policy.team_size_max(m_functor, ParallelForTag{})) | |||
if (m_team_size > m_policy.team_size_max(m_functor, ParallelReduceTag{})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yuk
core/unit_test/TestLocalDeepCopy.hpp
Outdated
#if defined(KOKKOS_ENABLE_SYCL) && !defined(KOKKOS_ARCH_INTEL_GPU) | ||
if (std::is_same<ExecSpace, Kokkos::Experimental::SYCL>::value) | ||
policy = team_policy(N, 512); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are failing because of the change team_size = {32 -> recommended_size}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think I'm requesting too many registers here on V100 but it seems to work fine on Intel GPUs
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change this? Why isn't the change you did above for defaulted team_size not giving you a valid large number (instead of 32) here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculations for default/maximum team size don't take registers into account (since this information is in general not available through the SYCL
interface). There are multiple places where we need to use a smaller workgroup size for the SYCL+CUDA CI (where it's not an issue for Intel GPUs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we then be generally more careful with the max size return? I.e. limit ourself to 256 threads if building for NVIDIA GPUs for example? That would mean that it will always fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll try that.
core/unit_test/TestLocalDeepCopy.hpp
Outdated
#if defined(KOKKOS_ENABLE_SYCL) && !defined(KOKKOS_ARCH_INTEL_GPU) | ||
if (std::is_same<ExecSpace, Kokkos::Experimental::SYCL>::value) | ||
policy = team_policy(N, 512); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change this? Why isn't the change you did above for defaulted team_size not giving you a valid large number (instead of 32) here too?
max_threads_for_memory, 256}) / | ||
impl_vector_length(); | ||
|
||
#else | ||
return std::min<int>( | ||
m_space.impl_internal_space_instance()->m_maxWorkgroupSize, | ||
max_threads_for_memory) / | ||
impl_vector_length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if you did
return std::min<int>(
{m_space.impl_internal_space_instance()->m_maxWorkgroupSize,
// FIXME_SYCL Avoid requesting to many registers on NVIDIA GPUs.
#if <ARCH IS NVIDIA GPU>
256,
#endif
max_threads_for_memory}) /
impl_vector_length();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
a818af0
to
6673db9
Compare
6673db9
to
befa4ca
Compare
Retest this please. |
Retest this please. |
Looking at the
KokkosKernels
tests, I ran across a case where the default team_size of 32 would exceed the maximum group size eligible for the kernel. This pull request usesteam_size_recommended
instead and this, in turn, is computed as the maximum power of two not exceedingteam_size_max
.The computation of both
team_size_recommended
andteam_size_max
probably is still not ideal but certainly an improvement. Any help or suggestions are of course welcome.