Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA][HIP] warp misaligned address on CUDA and results mismatch on HIP #5007

Closed
zjin-lcf opened this issue Nov 22, 2021 · 19 comments
Closed
Labels
bug Something isn't working cuda CUDA back-end hip Issues related to execution on HIP backend. runtime Runtime library related issue

Comments

@zjin-lcf
Copy link
Contributor

zjin-lcf commented Nov 22, 2021

Running the example https://github.com/zjin-lcf/HeCBench/blob/master/aop-sycl/main.cpp built with the CUDA support on a P100 GPU
shows warp misaligned address may be caused by the shared local memory "double4 lsums" in the kernel prepare_svd_kernel<256, PayoffPut>. The SYCL program runs successfully on an Intel GPU.

Did you encounter warp misaligned address when porting a CUDA program ?

Running the example built with the HIP support shows the result does not match the HIP/CUDA version:
To reproduce

make HIP=yes
./main

==============
Num Timesteps         : 100
Num Paths             : 32K
Num Runs              : 1
T                     : 1.000000
S0                    : 3.600000
K                     : 4.000000
r                     : 0.060000
sigma                 : 0.200000
Option Type           : American Put
==============
GPU Longstaff-Schwartz: 0.39776070   (the expected is 0.44783124)
@zjin-lcf zjin-lcf added the bug Something isn't working label Nov 22, 2021
@AerialMantis AerialMantis added cuda CUDA back-end hip Issues related to execution on HIP backend. runtime Runtime library related issue labels Nov 23, 2021
@AerialMantis AerialMantis added this to Needs triage in oneAPI DPC++ Nov 24, 2021
@AerialMantis AerialMantis moved this from Needs triage to Selected in oneAPI DPC++ Nov 29, 2021
@npmiller
Copy link
Contributor

npmiller commented Dec 2, 2021

Hello @zjin-lcf,

I've looked at this on CUDA and there is a bug with the way local kernel arguments are laid out in memory which causes the double4 argument address to be misaligned.

I'm investigating this for a proper fix however as a workaround you can simply re-order the kernel arguments, for example as follows:

diff --git a/aop-sycl/main.cpp b/aop-sycl/main.cpp
index 6a9c9edc..2b8830f0 100644
--- a/aop-sycl/main.cpp
+++ b/aop-sycl/main.cpp
@@ -1046,7 +1046,7 @@ void do_run(queue &q,
     accessor<double4, 1, sycl_read_write, access::target::local> lsums (1, cgh);
     accessor<int, 1, sycl_read_write, access::target::local> lsum (1, cgh);
     accessor<double, 1, sycl_read_write, access::target::local> smem_svds (R_W_MATRICES_SMEM_SLOTS, cgh);
-    cgh.parallel_for<class prepare_svd<Payoff>>(nd_range<1>(gws_prepare_svd, lws_prepare_svd), [=] (nd_item<1> item) {
+    cgh.parallel_for<class prepare_svd<Payoff>>(nd_range<1>(gws_prepare_svd, lws_prepare_svd), [num_paths, payoff, paths, all_out_of_the_money, svds, lsums, smem_svds, scan_input, scan_output, lsum] (nd_item<1> item) {
       prepare_svd_kernel<NUM_THREADS_PER_BLOCK1>(
           item,
           num_paths,

The local memory arguments seem to be placed in a shared memory buffer one after the other, so for example with an int and a double 4 you would have [int, double4], with no padding in between this breaks the alignment, so this patch just swaps them around [double4, int] so double4 gets the correct alignment and it still works for int because it requires a smaller alignment.

I'll update this ticket when I have a proper fix for this.

@zjin-lcf
Copy link
Contributor Author

zjin-lcf commented Dec 2, 2021

The function is called in the following way for the three pointers,

      lsums.get_pointer(),
      lsum.get_pointer(),
      smem_svds.get_pointer(),

Does the compiler have its own way to determine the order ?
Does [...] override the order ? Is it documented ?

Thanks

@npmiller
Copy link
Contributor

npmiller commented Dec 2, 2021

The actual "kernel" is the lambda of the parallel for, here there's no issues with the function.

So here the kernel arguments are whatever is captured by the [=] lambda, and the underlying order of the kernel arguments is whatever C++ usually does when capturing variables. Using [a, b] specifies manually what is being captured and in which order, so it allows you to re-order the underlying kernel arguments. This is not really SYCL specific though, just C++ lambdas.

I'm not entirely sure how [=] works but it's possible re-ordering the accessor declaration might also change the order and "fix" it.

In most cases the kernel argument order shouldn't matter and we definitely need to fix this so users don't have to worry about kernel argument order.

@zjin-lcf
Copy link
Contributor Author

zjin-lcf commented Dec 2, 2021

Thank you for explaining the kernel argument order.

@AerialMantis AerialMantis moved this from Selected to In progress in oneAPI DPC++ Dec 7, 2021
npmiller added a commit to npmiller/llvm that referenced this issue Dec 8, 2021
This patch comes from an attempt to fix intel#5007.

The issue there is that for local kernel argument the CUDA plugin uses
CUDA dynamic shared memory, which gives us a single chunk of shared
memory to work with.

The CUDA plugin then lays out all the local kernel arguments
consecutively in this single chunk of memory.

And this can cause issues because simply laying the arguments out one
after the other can result in misaligned arguments. In intel#5007 for example
there is an `int` argument followed by a `double4` argument, so the
`double4` argument ends up with the wrong alignment, only being aligned
on a 4 bytes boundary following from the `int`.

It is possible to adjust this and fixup the alignment when laying out
the local kernel arguments in the CUDA plugin, however before this patch
the only information in the plugin would be the total size of local
memory required for the given arguments, which doesn't tell us anything
about the required alignment.

So this patch propagates the size of the elements inside of the
local accessor all the way down to the PI plugin through
`piKernelSetArg`, and tweaks the local argument layout in the CUDA
plugin to use the type size as alignment for local kernel arguments.
npmiller added a commit to npmiller/llvm that referenced this issue Dec 9, 2021
The issue there is that for local kernel argument the CUDA plugin uses
CUDA dynamic shared memory, which gives us a single chunk of shared
memory to work with.

The CUDA plugin then lays out all the local kernel arguments
consecutively in this single chunk of memory.

And this can cause issues because simply laying the arguments out one
after the other can result in misaligned arguments.

So this patch is changing the argument layout to align them to the
maximum necessary alignment which is the size of the largest vector
type. Additionally if there is a local buffer smaller than this maximum
alignment, the size of that buffer is simply used for alignment.

This fixes the issue in intel#5007.

See also the discussion on intel#5104 for alternative solution, that may be
more efficient but would require a more intrusive ABI changing patch.
bader pushed a commit that referenced this issue Jan 10, 2022
The issue there is that for local kernel argument the CUDA plugin uses
CUDA dynamic shared memory, which gives us a single chunk of shared
memory to work with.

The CUDA plugin then lays out all the local kernel arguments
consecutively in this single chunk of memory.

And this can cause issues because simply laying the arguments out one
after the other can result in misaligned arguments.

So this patch is changing the argument layout to align them to the
maximum necessary alignment which is the size of the largest vector
type. Additionally if there is a local buffer smaller than this maximum
alignment, the size of that buffer is simply used for alignment.

This fixes the issue in #5007.

See also the discussion on #5104 for alternative solution, that may be
more efficient but would require a more intrusive ABI changing patch.
@AerialMantis
Copy link
Contributor

AerialMantis commented Jan 11, 2022

We believe the issue for CUDA to be address by #5113, there is a further pull request open for genializing this across both the NVPTX and AMDGCN LLVM backends, resolving this for the HIP as well - #5149.

Edit: I originally closed this issue and then re-opened it as it's not yet addressed for the HIP backend.

oneAPI DPC++ automation moved this from In progress to Closed Jan 11, 2022
@AerialMantis AerialMantis reopened this Jan 11, 2022
oneAPI DPC++ automation moved this from Closed to In progress Jan 11, 2022
@zjin-lcf
Copy link
Contributor Author

Running the sycl program on an nvidia gpu produces the right result after "ulimit -s unlimited'.
However, running the program on an amd gpu (gfx908) causes segfault. Could you reproduce that ? Thanks.

@npmiller
Copy link
Contributor

I'm not having any issues with aop-sycl on gfx908 with the latest:

  • intel/llvm: 028055893dc16605d966592f97843173bc2221d7
  • HeCBench: 13c7d1d55bcad1575d584b6f1ed4c2a663677b5b

Build command:

make GPU=yes HIP=yes HIP_ARCH=gfx908

Run command:

SYCL_DEVICE_FILTER=hip ./main       
==============
Num Timesteps         : 100
Num Paths             : 32K
Num Runs              : 1
T                     : 1.000000
S0                    : 3.600000
K                     : 4.000000
r                     : 0.060000
sigma                 : 0.200000
Option Type           : American Put
==============
GPU Longstaff-Schwartz: 0.44783124
Binonmial             : 0.44880498
European Price        : 0.38443078
==============
elapsed time for each run         : 22.000ms
==============

@zjin-lcf
Copy link
Contributor Author

Could you run:
./main -runs 100 ?

Thanks

@npmiller
Copy link
Contributor

Oh, that's interesting, I am getting a segfault with that:

SYCL_DEVICE_FILTER=hip ./main -runs 100                                                       
==============                                                                                                                                                               
Num Timesteps         : 100                                                                                                                                                  
Num Paths             : 32K                                                                                                                                                  
Num Runs              : 100               
T                     : 1.000000
S0                    : 3.600000                                                                                                                                             
K                     : 4.000000                                                                                                                                             
r                     : 0.060000                                                                                                                                             
sigma                 : 0.200000                                                                                                                                             
Option Type           : American Put                                                                                                                                         
[1]    40238 segmentation fault  SYCL_DEVICE_FILTER=hip ./main -runs 100
SYCL_DEVICE_FILTER=hip ./main -runs 100  19.99s user 0.21s system 103% cpu 19.450 total                                                                                      

@zjin-lcf
Copy link
Contributor Author

I remember segfault occurs on an nvidia p100 gpu without "ulimit -s unlimited".

@npmiller
Copy link
Contributor

So looking a bit closer at this I think there might be a bug and/or race condition in the SYCL runtime with regards to command deletion which triggers the segfault. I'll have to dig into it further but it seems like the environment variable SYCL_DISABLE_POST_ENQUEUE_CLEANUP works around the problem:

SYCL_DEVICE_FILTER=hip SYCL_DISABLE_POST_ENQUEUE_CLEANUP=1 ./main -runs 100
==============
Num Timesteps         : 100
Num Paths             : 32K
Num Runs              : 100
T                     : 1.000000
S0                    : 3.600000
K                     : 4.000000
r                     : 0.060000
sigma                 : 0.200000
Option Type           : American Put
==============
GPU Longstaff-Schwartz: 0.44747600
Binonmial             : 0.44880498
European Price        : 0.38443078
==============
elapsed time for each run         : 20.040ms
==============
SYCL_DEVICE_FILTER=hip SYCL_DISABLE_POST_ENQUEUE_CLEANUP=1 ./main -runs 100  37.71s user 0.27s system 104% cpu 36.477 total

With this it will likely use more memory but at least it shouldn't segfault while trying to access a deleted command.

@bader
Copy link
Contributor

bader commented Jan 24, 2022

@sergey-semenov, FYI.

@zjin-lcf
Copy link
Contributor Author

Thank you for pointing out the issue.

@pvchupin
Copy link
Contributor

pvchupin commented Feb 1, 2022

@sergey-semenov, can you confirm please that this is general runtime issue rather than CUDA specific?

@sergey-semenov
Copy link
Contributor

It certainly appears so. There's one known segfault problem related to post-enqueue cleanup and it has to do with execution graph leaf handling. @npmiller Could you please check if this fix (#5417) takes care of this segfault too? @KseniyaTikhomirova FYI

@npmiller
Copy link
Contributor

npmiller commented Feb 1, 2022

@sergey-semenov I've just tried it and as far as I can tell it does seem that #5417 fixes the segfault in this test. I've ran the sample successfully several times with 100 iterations and a couple times with 500 iterations with no issues.

@zjin-lcf I suspect this may also fix it for CUDA even without ulimit -s unlimited

@zjin-lcf
Copy link
Contributor Author

zjin-lcf commented Feb 1, 2022

I would like to run the program after #5417 is merged. I hope that is fine. Thank you for your updates and tests.

@npmiller
Copy link
Contributor

@zjin-lcf #5417 has been merged, have you had the time to test this again? Can we close this ticket?

@zjin-lcf
Copy link
Contributor Author

Yes. I ran the example. Thanks.

oneAPI DPC++ automation moved this from In progress to Closed Feb 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda CUDA back-end hip Issues related to execution on HIP backend. runtime Runtime library related issue
Projects
No open projects
oneAPI DPC++
  
Closed
Development

Successfully merging a pull request may close this issue.

6 participants