seg fault Kokkos::Impl::CudaInternal::print_configuration #338
Labels
Bug
Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Milestone
The following may not be an issue with kokkos itself, but anyway, here is the report:
I have access to large server (supermicro) with 14 GPU (7 * K80 boards).
deviceQuery from Cuda SDK (and other multi-gpu cuda app) are running fine but I noticed that
kokkos example named "query_device" gives a strange seg fault on this system: (I have added an output line right before call to cudaGetDeviceProperties in core/src/Cuda/Kokkos_Cuda_Impl.cpp (in CudaInternalDevices constructor) :
for ( int i = 0 ; i < m_cudaDevCount ; ++i ) {
printf("CudaInternalDevices: %d out of %d\n",i,m_cudaDevCount);
CUDA_SAFE_CALL( cudaGetDeviceProperties( m_cudaProp + i , i ) );
}
The output of query_device for this system (Centos 6.7, cuda7.5, gnu 4.9.3):
CudaInternalDevices: 0 out of 14
CudaInternalDevices: 1 out of 14
CudaInternalDevices: 2 out of 14
CudaInternalDevices: 3 out of 14
CudaInternalDevices: 4 out of 14
CudaInternalDevices: 5 out of 14
CudaInternalDevices: 6 out of 14
CudaInternalDevices: 7 out of 14
CudaInternalDevices: 8 out of 14
CudaInternalDevices: 9 out of 1819501908
CudaInternalDevices: 10 out of 1819501908
CudaInternalDevices: 11 out of 1819501908
CudaInternalDevices: 12 out of 1819501908
CudaInternalDevices: 13 out of 1819501908
Program received signal SIGSEGV, Segmentation fault.
__strlen_sse42 () at ../sysdeps/x86_64/multiarch/strlen-sse4.S:48
as if variable m_cudaDevCount was the victim of buffer overwrite.
I built kokkos with and without debug flag with the same behaviour; I tried to find a memory leak but was unable to find some (in cudart ?)
I tried to reboot the system (maybe a GPU driver issue) but even after reboot, the seg fault is there.
Finaly I don't really know what trigger this seg fault; any idea ?
The text was updated successfully, but these errors were encountered: