Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cap vector size to kernel maximum for SYCL #4704

Merged
merged 2 commits into from
Jan 26, 2022

Conversation

masterleinad
Copy link
Contributor

Fixes #4573.
For SYCL, the maximum vector size depends on the kernel and the maximum reported by the device might not be supported for every kernel. So far, we just errored out in case we detected the vector size being too large. This pull request tries to cap the vector size to the maximum supported by the kernel instead.
This is easy enough for parallel_for since the kernel doesn't use the vector size explicitly (meaning that it's fine to retrieve from within the kernel) but for parallel_reduce allocations for the reductions actually depend on the total workgroup size.
The idea here is to first build a kernel with dummy pointers for reduction results and temporary storage, use that one for querying the maximum vector size for the kernel, allocating temporary global and local space, and create the final kernel to be launched with these pointers.

The actual changes are not that large if you ignore whitespace changes.

@brian-kelley
Copy link
Contributor

@masterleinad So from the discussion, this does work for all parallel_fors in KokkosKernels which are valid Kokkos.

@masterleinad masterleinad added the Blocks Promotion Overview issue for release-blocking bugs label Jan 21, 2022
Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by comment

std::stringstream out;
out << "The maximum subgroup size (" << max_sg_size
<< ") for this kernel is not divisible by the vector_size ("
<< m_vector_size << "). Choose a smaller vector_size!\n";
<< final_vector_size << "). Choose a smaller vector_size!\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message needs to be revisited

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why this needs to be revisited? Should we explain that the vector size might have been capped?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message will only be triggered if m_vector_size was smaller than max_sg_size in the first place (since you use the min). So one doesn't need to necessarily choose a smaller vector size, just something which is divisible (which could be larger). That said, how would this be ever triggered? Is max_sg_size not a power of two? We kinda have that requirement for vector length in team policy.

std::stringstream out;
out << "The maximum subgroup size (" << max_sg_size
<< ") for this kernel is not divisible by the vector_size ("
<< m_vector_size << "). Choose a smaller vector_size!\n";
<< final_vector_size << "). Choose a smaller vector_size!\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message will only be triggered if m_vector_size was smaller than max_sg_size in the first place (since you use the min). So one doesn't need to necessarily choose a smaller vector size, just something which is divisible (which could be larger). That said, how would this be ever triggered? Is max_sg_size not a power of two? We kinda have that requirement for vector length in team policy.

std::stringstream out;
out << "The maximum subgroup size (" << max_sg_size
<< ") for this kernel is not divisible by the vector_size ("
<< m_vector_size << "). Choose a smaller vector_size!\n";
<< final_vector_size << "). Choose a smaller vector_size!\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the used vector length is the largest power-of-two not exceeding the input vector length. The SYCL standard doesn't specify that the possible subgroup sizes are powers-of-two but that at least seems to be true for NVIDIA and Intel GPUs.
I think we can remove the check.

@dalg24
Copy link
Member

dalg24 commented Jan 26, 2022

Failure (compiler crashing in OpenMPTarget build on Nvidia GPU) is clearly unrelated

@dalg24 dalg24 merged commit 1ec0d1e into kokkos:develop Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocks Promotion Overview issue for release-blocking bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants