You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On Clang 8.0 + CUDA 10.0 (kokkos-dev2), two kokkos-kernels tests are failing when 16-byte alignment of Kokkos::complex is enabled. I think this is because of an issue in cuda_inter_warp_reduction, but I'm not positive. If true, the cause of kokkos/kokkos-kernels#645 and kokkos/kokkos-kernels#704.
For both of the kokkos-kernels tests with this failure (cuda.sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace and cuda.team_dot_complex_double), cuda-memcheck catches the misaligned address at Kokkos_Cuda_ReduceScan.hpp:292 in device code.
Example 1:
[bmkelle@kokkos-dev-2 unit_test]$ cuda-memcheck --show-backtrace host
./KokkosKernels_blas_cuda --gtest_filter=cuda.team_dot_complex_double
========= CUDA-MEMCHECK
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Note: Google Test filter = cuda.team_dot_complex_double
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from cuda
[ RUN ] cuda.team_dot_complex_double
========= Invalid __shared__ write of size 16
========= at 0x00001010 in /ascldap/users/bmkelle/Fix703-testing-1588088203/Testing/TestAll_2020-04-28_08.36.50/clang/8.0/Cuda_OpenMP-release/kokkos-install/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:292:_ZN6Kokkos4ImplL33cuda_parallel_launch_local_memoryINS0_11ParallelForIZN4Test18impl_test_team_dotINS_4ViewIPNS_7complexIdEEJNS_10LayoutLeftENS_4CudaEEEESB_SA_EEviEUlRKNS0_14CudaTeamMemberEE_NS_10TeamPolicyIJSA_EEESA_EEEEvT_
========= by thread (0,0,0) in block (3,0,0)
========= Address 0x00000008 is misaligned
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x269e85]
========= Host Frame:/ascldap/users/projects/x86-64/cuda/10.0/lib64/libcudart.so.10.0 [0x1a4e0]
========= Host Frame:/ascldap/users/projects/x86-64/cuda/10.0/lib64/libcudart.so.10.0 (cudaLaunch + 0x143) [0x37fe3]
========= Host Frame:./KokkosKernels_blas_cuda [0xa7686]
========= Host Frame:./KokkosKernels_blas_cuda [0xa6b62]
========= Host Frame:./KokkosKernels_blas_cuda [0xa4818]
========= Host Frame:./KokkosKernels_blas_cuda [0x982d8]
========= Host Frame:./KokkosKernels_blas_cuda [0x32f2f4]
========= Host Frame:./KokkosKernels_blas_cuda [0x313170]
========= Host Frame:./KokkosKernels_blas_cuda [0x314060]
========= Host Frame:./KokkosKernels_blas_cuda [0x314797]
========= Host Frame:./KokkosKernels_blas_cuda [0x31d2b7]
========= Host Frame:./KokkosKernels_blas_cuda [0x32feb4]
========= Host Frame:./KokkosKernels_blas_cuda [0x31ce4b]
========= Host Frame:./KokkosKernels_blas_cuda [0x8150]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22545]
========= Host Frame:./KokkosKernels_blas_cuda [0x8069]
=========
Example 2:
[bmkelle@kokkos-dev-2 unit_test]$ cuda-memcheck --show-backtrace host ./KokkosKernels_sparse_cuda --gtest_filter=cuda.sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace
========= CUDA-MEMCHECK
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Note: Google Test filter = cuda.sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from cuda
[ RUN ] cuda.sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace
========= Invalid __shared__ write of size 16
========= at 0x000010e0 in /ascldap/users/bmkelle/Fix703-testing-1588088203/Testing/TestAll_2020-04-28_08.36.50/clang/8.0/Cuda_OpenMP-release/kokkos-install/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:292:_ZN6Kokkos4ImplL33cuda_parallel_launch_local_memoryINS0_11ParallelForIN12KokkosSparse4Impl12Experimental27TriLvlSchedTP1SolverFunctorINS_4ViewIPKiJNS_10LayoutLeftENS_6DeviceINS_4CudaENS_9CudaSpaceEEENS_12MemoryTraitsILj3EEEEEESH_NS7_IPKNS_7complexIdEEJSA_SE_SG_EEENS7_IPSJ_JSA_SE_NSF_ILj1EEEEEESM_NS7_IPiJSD_EEEEENS_10TeamPolicyIJSC_EEESC_EEEEvT_
========= by thread (0,32,0) in block (7,0,0)
========= Address 0x00000018 is misaligned
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x269e85]
========= Host Frame:/ascldap/users/projects/x86-64/cuda/10.0/lib64/libcudart.so.10.0 [0x1a4e0]
========= Host Frame:/ascldap/users/projects/x86-64/cuda/10.0/lib64/libcudart.so.10.0 (cudaLaunch + 0x143) [0x37fe3]
========= Host Frame:./KokkosKernels_sparse_cuda [0x853cb8]
========= Host Frame:./KokkosKernels_sparse_cuda [0x853292]
========= Host Frame:./KokkosKernels_sparse_cuda [0x852460]
========= Host Frame:./KokkosKernels_sparse_cuda [0x850b42]
========= Host Frame:./KokkosKernels_sparse_cuda [0x84f576]
========= Host Frame:./KokkosKernels_sparse_cuda [0x84be81]
========= Host Frame:./KokkosKernels_sparse_cuda [0x84aa34]
========= Host Frame:./KokkosKernels_sparse_cuda [0x104306]
========= Host Frame:./KokkosKernels_sparse_cuda [0xff3f8]
========= Host Frame:./KokkosKernels_sparse_cuda [0x951b3]
========= Host Frame:./KokkosKernels_sparse_cuda [0x8af774]
========= Host Frame:./KokkosKernels_sparse_cuda [0x893a90]
========= Host Frame:./KokkosKernels_sparse_cuda [0x894980]
========= Host Frame:./KokkosKernels_sparse_cuda [0x8950b7]
========= Host Frame:./KokkosKernels_sparse_cuda [0x89dbd7]
========= Host Frame:./KokkosKernels_sparse_cuda [0x8b0334]
========= Host Frame:./KokkosKernels_sparse_cuda [0x89d76b]
========= Host Frame:./KokkosKernels_sparse_cuda [0x8a80]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22545]
========= Host Frame:./KokkosKernels_sparse_cuda [0x8999]
=========
Looking at that function:
273 template <class ReducerType>
274 __device__ inline
275 typename std::enable_if<Kokkos::is_reducer<ReducerType>::value>::type
276 cuda_inter_warp_reduction(const ReducerType& reducer,
277 typename ReducerType::value_type value,
278 const int max_active_thread = blockDim.y) {
279 typedef typename ReducerType::value_type ValueType;
280
281 #define STEP_WIDTH 4
282 // Depending on the ValueType _shared__ memory must be aligned up to 8byte
283 // boundaries The reason not to use ValueType directly is that for types with
284 // constructors it could lead to race conditions
HERE: declaring __shared__ double array and then casting that to ValueType array.
ValueType in this case is complex<double> and requires 16-byte alignment.
285 __shared__ double sh_result[(sizeof(ValueType) + 7) / 8 * STEP_WIDTH];
286 ValueType* result = (ValueType*)&sh_result;
287 const int step = 32 / blockDim.x;
288 int shift = STEP_WIDTH;
289 const int id = threadIdx.y % step == 0 ? threadIdx.y / step : 65000;
290 if (id < STEP_WIDTH) {
291 result[id] = value;
292 }
293 __syncthreads();
294 while (shift <= max_active_thread / step) {
295 if (shift <= id && shift + STEP_WIDTH > id && threadIdx.x == 0) {
296 reducer.join(result[id % STEP_WIDTH], value);
297 }
298 __syncthreads();
299 shift += STEP_WIDTH;
300 }
301
302 value = result[0];
303 for (int i = 1; (i * step < max_active_thread) && i < STEP_WIDTH; i++)
304 reducer.join(value, result[i]);
305
306 reducer.reference() = value;
307 }
The only thing I don't understand is why this doesn't happen on GCC + CUDA builds. Maybe we just get lucky on the layout of shared memory? But a reduction using complex is a very common thing throughout kokkos-kernels.
Full reproducer instructions for kokkos-dev2:
module purge
module use /home/projects/x86-64/modulefiles/local
module load sems-env sems-cmake/3.12.2 kokkos-env clang/8.0 cuda/10.0
git clone git@github.com:kokkos/kokkos
cd kokkos
git checkout develop
cd ..
git clone git@github.com:kokkos/kokkos-kernels
cd kokkos-kernels
git checkout develop
cd ..
mkdir build
cd build
ln -s ../kokkos
../kokkos-kernels/cm_generate_makefile.bash --with-devices=Cuda,OpenMP --debug --arch=SNB,Volta70 --compiler=/home/projects/x86-64/clang/8.0/bin/clang++ --cxxflags="-O3 -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized -Wno-pass-failed " --cxxstandard="11" --ldflags="" --with-cuda=/home/projects/x86-64/cuda/10.0 --with-scalars='double,complex_double'
make -j20
cd unit_test
cuda-memcheck ./KokkosKernels_blas_cuda --gtest_filter=cuda.team_dot_complex_double
cuda-memcheck ./KokkosKernels_sparse_cuda --gtest_filter=cuda.sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace
The text was updated successfully, but these errors were encountered:
crtrott
added
the
Bug
Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
label
Apr 29, 2020
On Clang 8.0 + CUDA 10.0 (kokkos-dev2), two kokkos-kernels tests are failing when 16-byte alignment of Kokkos::complex is enabled. I think this is because of an issue in
cuda_inter_warp_reduction
, but I'm not positive. If true, the cause of kokkos/kokkos-kernels#645 and kokkos/kokkos-kernels#704.For both of the kokkos-kernels tests with this failure (cuda.sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace and cuda.team_dot_complex_double), cuda-memcheck catches the misaligned address at Kokkos_Cuda_ReduceScan.hpp:292 in device code.
Example 1:
Example 2:
Looking at that function:
The only thing I don't understand is why this doesn't happen on GCC + CUDA builds. Maybe we just get lucky on the layout of shared memory? But a reduction using complex is a very common thing throughout kokkos-kernels.
Full reproducer instructions for kokkos-dev2:
The text was updated successfully, but these errors were encountered: