Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA] Fix alignment of local arguments #5113

Merged
merged 3 commits into from
Jan 10, 2022

Conversation

npmiller
Copy link
Contributor

@npmiller npmiller commented Dec 9, 2021

The issue there is that for local kernel argument the CUDA plugin uses
CUDA dynamic shared memory, which gives us a single chunk of shared
memory to work with.

The CUDA plugin then lays out all the local kernel arguments
consecutively in this single chunk of memory.

And this can cause issues because simply laying the arguments out one
after the other can result in misaligned arguments.

So this patch is changing the argument layout to align them to the
maximum necessary alignment which is the size of the largest vector
type. Additionally if there is a local buffer smaller than this maximum
alignment, the size of that buffer is simply used for alignment.

This addresses #5007 for CUDA backend.

See also the discussion on #5104 for alternative solution, that may be
more efficient but would require a more intrusive ABI changing patch.

The issue there is that for local kernel argument the CUDA plugin uses
CUDA dynamic shared memory, which gives us a single chunk of shared
memory to work with.

The CUDA plugin then lays out all the local kernel arguments
consecutively in this single chunk of memory.

And this can cause issues because simply laying the arguments out one
after the other can result in misaligned arguments.

So this patch is changing the argument layout to align them to the
maximum necessary alignment which is the size of the largest vector
type. Additionally if there is a local buffer smaller than this maximum
alignment, the size of that buffer is simply used for alignment.

This fixes the issue in intel#5007.

See also the discussion on intel#5104 for alternative solution, that may be
more efficient but would require a more intrusive ABI changing patch.
@npmiller npmiller requested a review from a team as a code owner December 9, 2021 15:44
@zjin-lcf
Copy link
Contributor

zjin-lcf commented Dec 9, 2021

Is more shared local memory needed (the total size of the memory block) with aligned offset ?

@npmiller
Copy link
Contributor Author

npmiller commented Dec 9, 2021

Is more shared local memory needed (the total size of the memory block) with aligned offset ?

Good point, I forgot to update the size of the argument when adding the padding, should be good now.

@romanovvlad
Copy link
Contributor

LGTM, can we have a test for that? Cannot think of how such a test could be written though.

@npmiller
Copy link
Contributor Author

LGTM, can we have a test for that? Cannot think of how such a test could be written though.

I'll look into setting up a test for that, it should be fairly straightforward to check the alignment of the local argument addresses, it'll have to go in llvm-test-suite though.

npmiller added a commit to npmiller/llvm-test-suite that referenced this pull request Dec 10, 2021
This issue was solved in intel/llvm#5113, local kernel arguments have to
be aligned to the type size.
@zjin-lcf
Copy link
Contributor

@npmiller
What about a test for this: even if the two arguments are simply laid out consecutively, the double4 argument will still be correctly aligned ?

// Manually capture kernel arguments to ensure an order with the int
// argument first and the double4 argument second. If the two arguments are
// simply laid out consecutively, the double4 argument will not be
// correctly aligned.

@npmiller
Copy link
Contributor Author

I'm not sure I understand what you mean, when I say simply laid out consecutively I just mean if we don't add padding between the int and the double4 and just put one after the other in memory. The test I added should fail in that case because the double4 will not be aligned properly.

@zjin-lcf
Copy link
Contributor

With your pull request, users won't need to change their source code or be concerned about alignment. So I thought that a test might be added that does not explicitly specify the order. Is that right ?

@npmiller
Copy link
Contributor Author

Oh I see what you mean now. I'm not sure if that's necessary, the problem wasn't with the arguments capture, just the arguments order, so implicit or explicit capture doesn't really matter here. Except that specifying the arguments explicitly is just a clearer way to make sure the test reproduces an argument order that would trigger the bug. It is artificially making the worse argument order implicit capture could do.

@zjin-lcf
Copy link
Contributor

Thanks

@zjin-lcf
Copy link
Contributor

@bader
I installed the compiler today and then found that the PR is blocking. Is the blocked merging caused by the development of other features ? A general question is the average number of days a PR can be added to the main repository.

Thanks

@bader
Copy link
Contributor

bader commented Dec 20, 2021

Is the blocked merging caused by the development of other features ?

Are we talking about this particular PR? If so, it looks like this PR is blocked by missing approval from code owners.
I suggest pinging them - they might just missed that tests requested by @romanovvlad are added via a pull request for other repository.

@intel/llvm-reviewers-cuda, please, take a look.

@zjin-lcf
Copy link
Contributor

Yes. I was referring to the two PRs about local memory alignment/accessor.
I am not sure if code review is similar to paper review in terms of waiting time. I appreciate the reviewers' time and timeliness.

@keryell
Copy link
Contributor

keryell commented Dec 20, 2021

I am not sure if code review is similar to paper review in terms of waiting time.

This is mean! ;-)

steffenlarsen
steffenlarsen previously approved these changes Dec 21, 2021
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Sorry for the delay in review.

Would it be possible to add a regression test for this?

@bader
Copy link
Contributor

bader commented Dec 21, 2021

/verify with intel/llvm-test-suite#608

@steffenlarsen, does intel/llvm-test-suite#608 look good to you?

@steffenlarsen
Copy link
Contributor

@steffenlarsen, does intel/llvm-test-suite#608 look good to you?

Completely missed the link. Yes that test looks good. 😄

@bader
Copy link
Contributor

bader commented Dec 21, 2021

/verify with intel/llvm-test-suite#608

@bader
Copy link
Contributor

bader commented Dec 21, 2021

It looks like SYCL-Unit::CudaKernelsTest.PIKernelArgumentSetTwiceOneLocal test requires some updates with this change.

@npmiller
Copy link
Contributor Author

npmiller commented Jan 3, 2022

I've tweaked the alignment code a little bit so that it only adds padding if necessary, the original patch was a bit inefficient as it would still add the alignment even if the address was already aligned. Now it will only add some padding if needed.

This fixes SYCL-Unit::CudaKernelsTest.PIKernelArgumentSetTwiceOneLocal, because that test was expecting 0 for the one local argument it uses and this patch was turning it into 4, by adding extra unnecessary padding.

@bader bader merged commit ebb1281 into intel:sycl Jan 10, 2022
bader pushed a commit that referenced this pull request Jan 20, 2022
Now, that it lives in `SYCLLowerIR` it can be easily shared between AMDGCN and NVPTX backends.

This requires the same alignment fix as for Cuda, see: #5113

Fixes #5013
bader pushed a commit to intel/llvm-test-suite that referenced this pull request Mar 4, 2022
This issue was solved in intel/llvm#5113, local kernel arguments have to
be aligned to the type size.
aelovikov-intel pushed a commit that referenced this pull request Feb 17, 2023
This issue was solved in #5113, local kernel arguments have to
be aligned to the type size.
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
…test-suite#608)

This issue was solved in intel#5113, local kernel arguments have to
be aligned to the type size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants