Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly cuda/12.0, cuda/11.8 unit test failures #1663

Open
ndellingwood opened this issue Feb 1, 2023 · 16 comments
Open

Nightly cuda/12.0, cuda/11.8 unit test failures #1663

ndellingwood opened this issue Feb 1, 2023 · 16 comments

Comments

@ndellingwood
Copy link
Contributor

Sub-tests are failing in cuda/12.0 builds with the batched_dla_cuda and batched_gemm_cuda unit tests with error message cudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address

batched_dla_cuda

00:37:42 3: [ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
00:37:42 3: cudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address /home/jenkins/jenkins-new/workspace/KokkosKernels_KokkosDev2_CUDA120_GCC92_cpp17/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:132
00:37:42 3: Backtrace:
00:37:42 3:                         [0x675133]
00:37:42 3:                         [0x66d398]
00:37:42 3:                         [0x66d3cb]
00:37:42 3:                         [0x67b0fd]
00:37:42 3:                         [0x67bc91]
00:37:42 3:                         [0x651a65]
00:37:42 3:                         [0x5fd33a]
00:37:42 3:                         [0x43c787]
00:37:42 3:                         [0x64f34d]
00:37:42 3:                         [0x6432f3]
00:37:42 3:                         [0x6437a5]
00:37:42 3:                         [0x64398e]
00:37:42 3:                         [0x649a56]
00:37:42 3:                         [0x649cdb]
00:37:42 3:                         [0x413232]
00:37:42 3: __libc_start_main [0x7f7d99a18555]
00:37:42 3:                         [0x41acad]

batched_gemm_cuda

00:37:42 4: [ RUN      ] Cuda.batched_scalar_serial_gemm_nt_nt_dcomplex_dcomplex
00:37:42 4: cudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address /home/jenkins/jenkins-new/workspace/KokkosKernels_KokkosDev2_CUDA120_GCC92_cpp17/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:132
00:37:42 4: Backtrace:
00:37:42 4:                         [0x7c22d3]
00:37:42 4:                         [0x7ba538]
00:37:42 4:                         [0x7ba56b]
00:37:42 4:                         [0x7c829d]
00:37:42 4:                         [0x7c8e31]
00:37:42 4:                         [0x79eb35]
00:37:42 4:                         [0x45e0b1]
00:37:42 4:                         [0x6fda73]
00:37:42 4:                         [0x6fe1d6]
00:37:42 4:                         [0x79c41d]
00:37:42 4:                         [0x7903c3]
00:37:42 4:                         [0x790875]
00:37:42 4:                         [0x790a5e]
00:37:42 4:                         [0x796b26]
00:37:42 4:                         [0x796dab]
00:37:42 4:                         [0x40e032]
00:37:42 4: __libc_start_main [0x7fb3be79b555]
00:37:42 4:                         [0x4170dd]

Reproducer (kokkos-dev-2):

source /projects/sems/modulefiles/utils/sems-archive-modules-init.sh ; module use /home/projects/x86-64/modulefiles/local
module purge
module load sems-archive-env sems-archive-cmake/3.17.1 sems-archive-gcc/9.2.0 cuda/12.0

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Cuda,Serial --arch=Volta70 --compiler=$KOKKOS_PATH/bin/nvcc_wrapper --cxxflags="-O3 -Wall -Wunused-parameter -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="17" --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-cuda-options=enable_lambda   --no-examples
@e10harvey e10harvey added this to To do in Developer: E10HARVEY via automation Aug 10, 2023
@e10harvey
Copy link
Contributor

@lucbv: Do you have any notes on this so I can pickup from where you left off or do you want to pair up?

@e10harvey e10harvey moved this from To do to In progress in Developer: E10HARVEY Aug 10, 2023
@e10harvey
Copy link
Contributor

Notes:

@e10harvey
Copy link
Contributor

Relevant snippet from memcheck:

========= Invalid __local__ read of size 16 bytes
=========     at 0xdeadbeef in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<Test::SerialInverseLU::Functor_BatchedSerialGemm<Kokkos::Cuda, Kokkos::View<Kokkos::complex<double> ***, Kokkos::LayoutLeft, Kokkos::Cuda>, Kokkos::complex<double>, Test::SerialInverseLU::ParamTag<KokkosBlas::Trans::NoTranspose, KokkosBlas::Trans::NoTranspose>, KokkosBlas::Algo::Level3::Blocked>, Kokkos::RangePolicy<Kokkos::Cuda, Test::SerialInverseLU::ParamTag<KokkosBlas::Trans::NoTranspose, KokkosBlas::Trans::NoTranspose>>, Kokkos::Cuda>>(T1

Note that all inverselu invalid reads come from the Blocked algo type.

@e10harvey
Copy link
Contributor

e10harvey commented Aug 23, 2023

Note: Cuda/12 wants all addresses 16-byte aligned but, in the BatchedSerialGemm Blocked implementation, we de-reference a address that is 8-byte aligned.

TODO: Print out pointer scalar types and their size as well as the starting addresses of views/subviews.

@e10harvey
Copy link
Contributor

e10harvey commented Aug 28, 2023

After more debugging I have determined that the misalignment is stemming from Functor_BatchedSerialGemm in Test_Batched_SerialInverseLU.hpp of an address outside the control of the parallel_for caller.

@e10harvey
Copy link
Contributor

Given that the functor in question does not use any addresses that are violating 16-byte alignment nor do locals (&_alpha or &_beta) violate 16-byte alignment, I believe this is either a Kokkos Core or a compiler bug. Regardless of where the bug stems from, we should ask someone from Cuda or Kokkos Core to investigate further.

@e10harvey
Copy link
Contributor

Here are more triaging results. Note that local memory can only be allocated by the compiler.

  1. Christian and I tried moving Scalar _alpha, _beta above the declaration of the _a, _b, _c locals in the functor class definition:
template <typename DeviceType, typename ViewType, typename ScalarType,
          typename ParamTagType, typename AlgoTagType>
struct Functor_BatchedSerialGemm {
  ScalarType _alpha, _beta;
  ViewType _a, _b, _c;

This change resulted in passing tests in cuda/12.0.

@e10harvey
Copy link
Contributor

The (register allocation bug?) still persists in cuda/12.2.

KokkosKernels HEAD SHA: 6c06bd0
Kokkos HEAD SHA: 7e299b4e25c42528e105379c3aa9a318056545ba

Local changes in KokkosKernels: kk_local_changes.txt

Local change in Kokkos: none.

module load sems-archive-env sems-archive-cmake/3.17.1 gcc/11 nvhpc/23.7
make -j16 KokkosKernels_batched_dla_cuda
$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='*Cuda.batched_scalar_serial_inverselu_dcomplex*'
Note: Google Test filter = *Cuda.batched_scalar_serial_inverselu_dcomplex*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Cuda
[ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
mark0
mark1
a0:0x7fd615f32480
a1:0x7fd615f32680
c0:0x7fd615f32a80
w:0x7fd615f32880
sizeof(AViewType::value_type):16
&_alpha:0x7ffc56298260
&_beta:0x7ffc56298270
mark2
i:0
mark0
mark1
a0:0x7fd615f32480
a1:0x7fd615f32680
c0:0x7fd615f32a80
w:0x7fd615f32880
sizeof(AViewType::value_type):16
&_alpha:0x7ffc56298260
&_beta:0x7ffc56298270
(CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorMisalignedAddress): misaligned address /ascldap/users/eharvey/KOKKOS.base/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:153
Backtrace:
[0x6a0e23] 
[0x69b148] 
[0x69b17b] 
[0x6a6ee7] 
[0x6a786d] 
[0x67f4a5] 
[0x65961a] 
[0x47829d] 
[0x67cf8d] 
[0x672b98] 
[0x67332d] 
[0x673544] 
[0x6773d4] 
[0x672316] 
[0x4193c2] 
[0x7fd63cc7d555] __libc_start_main
[0x420b6d] 
Aborted (core dumped)
  1. Here is some additional sizeof and aligof information using cuda/12.2:
$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='*Cuda.batched_scalar_serial_inverselu_dcomplex*'
Note: Google Test filter = *Cuda.batched_scalar_serial_inverselu_dcomplex*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Cuda
[ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df32680
c0:0x7fc68df32a80
w:0x7fc68df32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
mark2
i:0
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df32680
c0:0x7fc68df32a80
w:0x7fc68df32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:1
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df36680
c0:0x7fc68df3ea80
w:0x7fc68df3a880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:2
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df42680
c0:0x7fc68df62a80
w:0x7fc68df52880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:3
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df56680
c0:0x7fc68df9ea80
w:0x7fc68df7a880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:4
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df72680
c0:0x7fc67d400080
w:0x7fc68dfb2880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:5
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df96680
c0:0x7fc67d464280
w:0x7fc67d400080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:6
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc67d400080
c0:0x7fc67d520480
w:0x7fc67d490280
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:7
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc67d400080
c0:0x7fc67dc00080
w:0x7fc67d4c4280
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:8
mark0
mark1
a0:0x7fc67d400080
a1:0x7fc67dc00080
c0:0x7fc67e000080
w:0x7fc67de00080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:9
mark0
mark1
a0:0x7fc67d400080
a1:0x7fc67dc00080
c0:0x7fc67e000080
w:0x7fc67de00080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df32680
c0:0x7fc68df32a80
w:0x7fc68df32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
mark2
i:0
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df32680
c0:0x7fc68df32a80
w:0x7fc68df32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:1
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df36680
c0:0x7fc68df3ea80
w:0x7fc68df3a880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:2
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df42680
c0:0x7fc68df62a80
w:0x7fc68df52880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:3
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df56680
c0:0x7fc68df9ea80
w:0x7fc68df7a880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:4
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df72680
c0:0x7fc67d400080
w:0x7fc68dfb2880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:5
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df96680
c0:0x7fc67d464280
w:0x7fc67d400080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:6
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc67d400080
c0:0x7fc67d520480
w:0x7fc67d490280
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:7
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc67d400080
c0:0x7fc67dc00080
w:0x7fc67d4c4280
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:8
mark0
mark1
a0:0x7fc67d400080
a1:0x7fc67dc00080
c0:0x7fc67e000080
w:0x7fc67de00080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:9
mark0
mark1
a0:0x7fc67d400080
a1:0x7fc67dc00080
c0:0x7fc67e000080
w:0x7fc67de00080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
[       OK ] Cuda.batched_scalar_serial_inverselu_dcomplex (116 ms)
[----------] 1 test from Cuda (116 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (116 ms total)
[  PASSED  ] 1 test.

NOTE: You have to comment out the following prints in the operator to trigger misalignment:

  KOKKOS_INLINE_FUNCTION
  void operator()(const ParamTagType &, const int k) const {
    auto aa = Kokkos::subview(_a, k, Kokkos::ALL(), Kokkos::ALL());
    auto bb = Kokkos::subview(_b, k, Kokkos::ALL(), Kokkos::ALL());
    auto cc = Kokkos::subview(_c, k, Kokkos::ALL(), Kokkos::ALL());

    /* if (k == 0) {
      printf("In Operator: alignof(decltype(*this)):%lu\n", alignof(decltype(*this)));
      printf("In Operator: alignof(decltype(ViewType)):%lu\n", alignof(ViewType));
      printf("In Operator: alignof(decltype(ScalarType)):%lu\n", alignof(ScalarType));
    } */
$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='*Cuda.batched_scalar_serial_inverselu_dcomplex*'
Note: Google Test filter = *Cuda.batched_scalar_serial_inverselu_dcomplex*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Cuda
[ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
mark0
mark1
a0:0x7ff6a7f32480
a1:0x7ff6a7f32680
c0:0x7ff6a7f32a80
w:0x7ff6a7f32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffd8a1ea7b0
&_beta:0x7ffd8a1ea7c0
mark2
i:0
mark0
mark1
a0:0x7ff6a7f32480
a1:0x7ff6a7f32680
c0:0x7ff6a7f32a80
w:0x7ff6a7f32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffd8a1ea7b0
&_beta:0x7ffd8a1ea7c0
(CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorMisalignedAddress): misaligned address /ascldap/users/eharvey/KOKKOS.base/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:153
Backtrace:
[0x6a11b3] 
[0x69b4d8] 
[0x69b50b] 
[0x6a7277] 
[0x6a7bfd] 
[0x67f835] 
[0x65996e] 
[0x47828d] 
[0x67d31d] 
[0x672f28] 
[0x6736bd] 
[0x6738d4] 
[0x677764] 
[0x6726a6] 
[0x419382] 
[0x7ff6d284e555] __libc_start_main
[0x420b2d] 
Aborted (core dumped)

@matt-stack
Copy link

Hello, I am looking into this bug, and came across something I found strange. If you keep all the source for the test the same, but take out one Kokkos::abort, then it seems to not hit this error message. Does anyone have an idea why that would be?

change the abort here to just return 0; or comment it out entirely.

  if (!(m <= 2 && n <= 2))
   Kokkos::abort(
        "InnerGemmFixC<2,2>::serial_invoke, assert failure (m<=2 && n<=2)");

to

  if (!(m <= 2 && n <= 2)) return 0;
//    Kokkos::abort(
//        "InnerGemmFixC<2,2>::serial_invoke, assert failure (m<=2 && n<=2)");

And on my machine I get no error. Because of the lack of abort, am I just missing a cudaCheckLastError call or something like that? I cant tell yet if the Kokkos::abort is an issue here, or its causing me to miss the trigger for the bug, or its not printing the Cuda error. Though when I searched through the src for cuda_abort, it looks like it just prints the message you give it. @crtrott for vis

@cwpearson cwpearson assigned cwpearson and unassigned e10harvey Oct 18, 2023
@ndellingwood
Copy link
Contributor Author

Just to update, these two tests fail with cd8f77c when enabling complex_double types in builds with c++20 enabled as well using for example cuda/12.0.0 + gcc/11.3.0

@ndellingwood
Copy link
Contributor Author

If I configure with the option -DKokkos_ENABLE_COMPLEX_ALIGN=OFF then the tests posted above pass
Adding @crtrott @dalg24 @masterleinad to the loop

@ndellingwood ndellingwood changed the title Nightly cuda/12.0 unit test failures Nightly cuda/12.0, cuda/11.8 unit test failures Apr 13, 2024
@ndellingwood
Copy link
Contributor Author

The same tests fail with cuda/11.8.0 when testing with cusparse and magma tpls enabled

@ndellingwood
Copy link
Contributor Author

Updating the issue to confirm the same tests still fail with cuda/11.8.0, cuda/12.0 +/- c++20 on Weaver (Volta70+Power9) with SHA 32aa75a

Configuration (Weaver, cuda/12.0 w/ c++20):

bsub -Is -n 1 -q rhel8 -gpu "num=1" bash

source /etc/profile.d/modules.sh
module load cmake git gcc/11.3.0 cuda/12.0.0

${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-cuda --with-serial --compiler=${KOKKOS_PATH}/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --cxxstandard=20

Test failures:

16:17:09 The following tests FAILED:
16:17:09 	  3 - batched_dla_cuda (Subprocess aborted)
16:17:09 	  4 - batched_gemm_cuda (Subprocess aborted)

@ndellingwood
Copy link
Contributor Author

The tests above passed on kokkos-dev-2 with sems-cuda/12.4 + sems-gcc/13.2.0

@lucbv
Copy link
Contributor

lucbv commented Jun 4, 2024

@ndellingwood so with cuda 12.4 we have the batched_dla_cuda and batched_gemm_cuda working correctly? Anything else failing on that platform?

@ndellingwood
Copy link
Contributor Author

@ndellingwood so with cuda 12.4 we have the batched_dla_cuda and batched_gemm_cuda working correctly? Anything else failing on that platform?

@lucbv on kokkos-dev-2 the configuration here (with Power9 dropped), using sems-cuda/12.4, the tests passed 100%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Developer: E10HARVEY
  
In progress
Development

No branches or pull requests

5 participants