Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide valid default team size for SYCL #4481

Merged
merged 12 commits into from
Dec 15, 2021

Conversation

masterleinad
Copy link
Contributor

Looking at the KokkosKernels tests, I ran across a case where the default team_size of 32 would exceed the maximum group size eligible for the kernel. This pull request uses team_size_recommended instead and this, in turn, is computed as the maximum power of two not exceeding team_size_max.
The computation of both team_size_recommended and team_size_max probably is still not ideal but certainly an improvement. Any help or suggestions are of course welcome.

@masterleinad masterleinad marked this pull request as draft October 29, 2021 13:41
}

template <class FunctorType>
int internal_team_size_recommended_reduce(const FunctorType& f) const {
// FIXME_SYCL improve
return internal_team_size_max_reduce(f);
const int max_team_size_half = internal_team_size_max_reduce(f) / 2;
int power_of_two = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is this trick which makes the power of 2 calculation O(1) and no jumps: https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using Kokkos::log2 instead (which uses intrinsics where available).

@masterleinad masterleinad force-pushed the fix_default_team_size_sycl branch 2 times, most recently from 1e7f4ea to 09f0d8f Compare November 10, 2021 18:37
@masterleinad
Copy link
Contributor Author

The test failure is unrelated.

@masterleinad masterleinad marked this pull request as ready for review November 23, 2021 19:18
@@ -812,7 +816,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
Kokkos::Impl::throw_runtime_exception(out.str());
}

if (m_team_size > m_policy.team_size_max(m_functor, ParallelForTag{}))
if (m_team_size > m_policy.team_size_max(m_functor, ParallelReduceTag{}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yuk

core/src/SYCL/Kokkos_SYCL_Parallel_Team.hpp Outdated Show resolved Hide resolved
#if defined(KOKKOS_ENABLE_SYCL) && !defined(KOKKOS_ARCH_INTEL_GPU)
if (std::is_same<ExecSpace, Kokkos::Experimental::SYCL>::value)
policy = team_policy(N, 512);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are failing because of the change team_size = {32 -> recommended_size}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I'm requesting too many registers here on V100 but it seems to work fine on Intel GPUs...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change this? Why isn't the change you did above for defaulted team_size not giving you a valid large number (instead of 32) here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculations for default/maximum team size don't take registers into account (since this information is in general not available through the SYCL interface). There are multiple places where we need to use a smaller workgroup size for the SYCL+CUDA CI (where it's not an issue for Intel GPUs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we then be generally more careful with the max size return? I.e. limit ourself to 256 threads if building for NVIDIA GPUs for example? That would mean that it will always fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll try that.

#if defined(KOKKOS_ENABLE_SYCL) && !defined(KOKKOS_ARCH_INTEL_GPU)
if (std::is_same<ExecSpace, Kokkos::Experimental::SYCL>::value)
policy = team_policy(N, 512);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change this? Why isn't the change you did above for defaulted team_size not giving you a valid large number (instead of 32) here too?

max_threads_for_memory, 256}) /
impl_vector_length();

#else
return std::min<int>(
m_space.impl_internal_space_instance()->m_maxWorkgroupSize,
max_threads_for_memory) /
impl_vector_length();
Copy link
Member

@dalg24 dalg24 Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if you did

    return std::min<int>(
                {m_space.impl_internal_space_instance()->m_maxWorkgroupSize,
// FIXME_SYCL Avoid requesting to many registers on NVIDIA GPUs.
#if <ARCH IS NVIDIA GPU>
                 256,
#endif
                 max_threads_for_memory}) /
            impl_vector_length();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@masterleinad
Copy link
Contributor Author

Retest this please.

@masterleinad
Copy link
Contributor Author

Retest this please.

@dalg24 dalg24 merged commit 1dbc97a into kokkos:develop Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants