Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

complex_double misalignment in reduce, clang+CUDA #2989

Closed
brian-kelley opened this issue Apr 29, 2020 · 0 comments
Closed

complex_double misalignment in reduce, clang+CUDA #2989

brian-kelley opened this issue Apr 29, 2020 · 0 comments
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)

Comments

@brian-kelley
Copy link
Contributor

On Clang 8.0 + CUDA 10.0 (kokkos-dev2), two kokkos-kernels tests are failing when 16-byte alignment of Kokkos::complex is enabled. I think this is because of an issue in cuda_inter_warp_reduction, but I'm not positive. If true, the cause of kokkos/kokkos-kernels#645 and kokkos/kokkos-kernels#704.

For both of the kokkos-kernels tests with this failure (cuda.sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace and cuda.team_dot_complex_double), cuda-memcheck catches the misaligned address at Kokkos_Cuda_ReduceScan.hpp:292 in device code.

Example 1:

[bmkelle@kokkos-dev-2 unit_test]$ cuda-memcheck --show-backtrace host 

./KokkosKernels_blas_cuda --gtest_filter=cuda.team_dot_complex_double
========= CUDA-MEMCHECK
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Note: Google Test filter = cuda.team_dot_complex_double
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from cuda
[ RUN      ] cuda.team_dot_complex_double
========= Invalid __shared__ write of size 16
=========     at 0x00001010 in /ascldap/users/bmkelle/Fix703-testing-1588088203/Testing/TestAll_2020-04-28_08.36.50/clang/8.0/Cuda_OpenMP-release/kokkos-install/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:292:_ZN6Kokkos4ImplL33cuda_parallel_launch_local_memoryINS0_11ParallelForIZN4Test18impl_test_team_dotINS_4ViewIPNS_7complexIdEEJNS_10LayoutLeftENS_4CudaEEEESB_SA_EEviEUlRKNS0_14CudaTeamMemberEE_NS_10TeamPolicyIJSA_EEESA_EEEEvT_
=========     by thread (0,0,0) in block (3,0,0)
=========     Address 0x00000008 is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x269e85]
=========     Host Frame:/ascldap/users/projects/x86-64/cuda/10.0/lib64/libcudart.so.10.0 [0x1a4e0]
=========     Host Frame:/ascldap/users/projects/x86-64/cuda/10.0/lib64/libcudart.so.10.0 (cudaLaunch + 0x143) [0x37fe3]
=========     Host Frame:./KokkosKernels_blas_cuda [0xa7686]
=========     Host Frame:./KokkosKernels_blas_cuda [0xa6b62]
=========     Host Frame:./KokkosKernels_blas_cuda [0xa4818]
=========     Host Frame:./KokkosKernels_blas_cuda [0x982d8]
=========     Host Frame:./KokkosKernels_blas_cuda [0x32f2f4]
=========     Host Frame:./KokkosKernels_blas_cuda [0x313170]
=========     Host Frame:./KokkosKernels_blas_cuda [0x314060]
=========     Host Frame:./KokkosKernels_blas_cuda [0x314797]
=========     Host Frame:./KokkosKernels_blas_cuda [0x31d2b7]
=========     Host Frame:./KokkosKernels_blas_cuda [0x32feb4]
=========     Host Frame:./KokkosKernels_blas_cuda [0x31ce4b]
=========     Host Frame:./KokkosKernels_blas_cuda [0x8150]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22545]
=========     Host Frame:./KokkosKernels_blas_cuda [0x8069]
=========

Example 2:

[bmkelle@kokkos-dev-2 unit_test]$ cuda-memcheck --show-backtrace host ./KokkosKernels_sparse_cuda --gtest_filter=cuda.sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace
========= CUDA-MEMCHECK
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Note: Google Test filter = cuda.sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from cuda
[ RUN      ] cuda.sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace
========= Invalid __shared__ write of size 16
=========     at 0x000010e0 in /ascldap/users/bmkelle/Fix703-testing-1588088203/Testing/TestAll_2020-04-28_08.36.50/clang/8.0/Cuda_OpenMP-release/kokkos-install/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:292:_ZN6Kokkos4ImplL33cuda_parallel_launch_local_memoryINS0_11ParallelForIN12KokkosSparse4Impl12Experimental27TriLvlSchedTP1SolverFunctorINS_4ViewIPKiJNS_10LayoutLeftENS_6DeviceINS_4CudaENS_9CudaSpaceEEENS_12MemoryTraitsILj3EEEEEESH_NS7_IPKNS_7complexIdEEJSA_SE_SG_EEENS7_IPSJ_JSA_SE_NSF_ILj1EEEEEESM_NS7_IPiJSD_EEEEENS_10TeamPolicyIJSC_EEESC_EEEEvT_
=========     by thread (0,32,0) in block (7,0,0)
=========     Address 0x00000018 is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x269e85]
=========     Host Frame:/ascldap/users/projects/x86-64/cuda/10.0/lib64/libcudart.so.10.0 [0x1a4e0]
=========     Host Frame:/ascldap/users/projects/x86-64/cuda/10.0/lib64/libcudart.so.10.0 (cudaLaunch + 0x143) [0x37fe3]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x853cb8]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x853292]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x852460]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x850b42]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x84f576]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x84be81]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x84aa34]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x104306]
=========     Host Frame:./KokkosKernels_sparse_cuda [0xff3f8]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x951b3]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x8af774]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x893a90]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x894980]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x8950b7]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x89dbd7]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x8b0334]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x89d76b]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x8a80]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22545]
=========     Host Frame:./KokkosKernels_sparse_cuda [0x8999]
=========

Looking at that function:

273 template <class ReducerType>
274 __device__ inline
275     typename std::enable_if<Kokkos::is_reducer<ReducerType>::value>::type
276     cuda_inter_warp_reduction(const ReducerType& reducer,
277                               typename ReducerType::value_type value,
278                               const int max_active_thread = blockDim.y) {
279   typedef typename ReducerType::value_type ValueType;
280 
281 #define STEP_WIDTH 4
282   // Depending on the ValueType _shared__ memory must be aligned up to 8byte
283   // boundaries The reason not to use ValueType directly is that for types with
284   // constructors it could lead to race conditions

HERE: declaring __shared__ double array and then casting that to ValueType array.
ValueType in this case is complex<double> and requires 16-byte alignment.

285   __shared__ double sh_result[(sizeof(ValueType) + 7) / 8 * STEP_WIDTH];
286   ValueType* result = (ValueType*)&sh_result;
287   const int step    = 32 / blockDim.x;
288   int shift         = STEP_WIDTH;
289   const int id      = threadIdx.y % step == 0 ? threadIdx.y / step : 65000;
290   if (id < STEP_WIDTH) {
291     result[id] = value;
292   }
293   __syncthreads();
294   while (shift <= max_active_thread / step) {
295     if (shift <= id && shift + STEP_WIDTH > id && threadIdx.x == 0) {
296       reducer.join(result[id % STEP_WIDTH], value);
297     }
298     __syncthreads();
299     shift += STEP_WIDTH;
300   }
301 
302   value = result[0];
303   for (int i = 1; (i * step < max_active_thread) && i < STEP_WIDTH; i++)
304     reducer.join(value, result[i]);
305 
306   reducer.reference() = value;
307 }

The only thing I don't understand is why this doesn't happen on GCC + CUDA builds. Maybe we just get lucky on the layout of shared memory? But a reduction using complex is a very common thing throughout kokkos-kernels.

Full reproducer instructions for kokkos-dev2:

module purge
module use /home/projects/x86-64/modulefiles/local
module load sems-env sems-cmake/3.12.2 kokkos-env clang/8.0 cuda/10.0

git clone git@github.com:kokkos/kokkos
cd kokkos
git checkout develop
cd ..
git clone git@github.com:kokkos/kokkos-kernels
cd kokkos-kernels
git checkout develop
cd ..
mkdir build
cd build
ln -s ../kokkos
../kokkos-kernels/cm_generate_makefile.bash --with-devices=Cuda,OpenMP --debug --arch=SNB,Volta70 --compiler=/home/projects/x86-64/clang/8.0/bin/clang++ --cxxflags="-O3 -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized -Wno-pass-failed " --cxxstandard="11" --ldflags="" --with-cuda=/home/projects/x86-64/cuda/10.0 --with-scalars='double,complex_double'
make -j20
cd unit_test
cuda-memcheck ./KokkosKernels_blas_cuda --gtest_filter=cuda.team_dot_complex_double
cuda-memcheck ./KokkosKernels_sparse_cuda --gtest_filter=cuda.sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace
@crtrott crtrott added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Apr 29, 2020
dhollman pushed a commit to dhollman/kokkos that referenced this issue May 4, 2020
Also probably fixes kokkos/kokkos-kernels#645 and kokkos/kokkos-kernels#704.

The reproducer in kokkos#2989 now passes with no errors.
crtrott pushed a commit to crtrott/kokkos that referenced this issue May 5, 2020
Also probably fixes kokkos/kokkos-kernels#645 and kokkos/kokkos-kernels#704.

The reproducer in kokkos#2989 now passes with no errors.
crtrott added a commit that referenced this issue May 5, 2020
…lignment

Address issue #2989: complex double misalignment
@crtrott crtrott closed this as completed May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Projects
None yet
Development

No branches or pull requests

2 participants