-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nightly cuda/12.0, cuda/11.8 unit test failures #1663
Comments
@lucbv: Do you have any notes on this so I can pickup from where you left off or do you want to pair up? |
Notes:
|
Relevant snippet from memcheck:
Note that all inverselu invalid reads come from the Blocked algo type. |
Note: Cuda/12 wants all addresses 16-byte aligned but, in the BatchedSerialGemm Blocked implementation, we de-reference a address that is 8-byte aligned. TODO: Print out pointer scalar types and their size as well as the starting addresses of views/subviews. |
After more debugging I have determined that the misalignment is stemming from |
Given that the functor in question does not use any addresses that are violating 16-byte alignment nor do locals ( |
Here are more triaging results. Note that local memory can only be allocated by the compiler.
This change resulted in passing tests in cuda/12.0. |
The (register allocation bug?) still persists in cuda/12.2. KokkosKernels HEAD SHA: 6c06bd0 Local changes in KokkosKernels: kk_local_changes.txt Local change in Kokkos: none.
NOTE: You have to comment out the following prints in the operator to trigger misalignment:
|
Hello, I am looking into this bug, and came across something I found strange. If you keep all the source for the test the same, but take out one Kokkos::abort, then it seems to not hit this error message. Does anyone have an idea why that would be? change the abort here to just return 0; or comment it out entirely.
to
And on my machine I get no error. Because of the lack of abort, am I just missing a cudaCheckLastError call or something like that? I cant tell yet if the Kokkos::abort is an issue here, or its causing me to miss the trigger for the bug, or its not printing the Cuda error. Though when I searched through the src for cuda_abort, it looks like it just prints the message you give it. @crtrott for vis |
Just to update, these two tests fail with cd8f77c when enabling complex_double types in builds with c++20 enabled as well using for example cuda/12.0.0 + gcc/11.3.0 |
If I configure with the option |
The same tests fail with cuda/11.8.0 when testing with cusparse and magma tpls enabled |
Updating the issue to confirm the same tests still fail with cuda/11.8.0, cuda/12.0 +/- c++20 on Weaver (Volta70+Power9) with SHA 32aa75a Configuration (Weaver, cuda/12.0 w/ c++20): bsub -Is -n 1 -q rhel8 -gpu "num=1" bash
source /etc/profile.d/modules.sh
module load cmake git gcc/11.3.0 cuda/12.0.0
${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-cuda --with-serial --compiler=${KOKKOS_PATH}/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --cxxstandard=20 Test failures:
|
The tests above passed on kokkos-dev-2 with sems-cuda/12.4 + sems-gcc/13.2.0 |
@ndellingwood so with cuda 12.4 we have the |
@lucbv on kokkos-dev-2 the configuration here (with Power9 dropped), using sems-cuda/12.4, the tests passed 100% |
Sub-tests are failing in cuda/12.0 builds with the
batched_dla_cuda
andbatched_gemm_cuda
unit tests with error messagecudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address
batched_dla_cuda
batched_gemm_cuda
Reproducer (kokkos-dev-2):
The text was updated successfully, but these errors were encountered: