Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seg fault Kokkos::Impl::CudaInternal::print_configuration #338

Closed
pkestene opened this issue Jun 28, 2016 · 5 comments
Closed

seg fault Kokkos::Impl::CudaInternal::print_configuration #338

pkestene opened this issue Jun 28, 2016 · 5 comments
Assignees
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Milestone

Comments

@pkestene
Copy link
Contributor

The following may not be an issue with kokkos itself, but anyway, here is the report:

I have access to large server (supermicro) with 14 GPU (7 * K80 boards).

deviceQuery from Cuda SDK (and other multi-gpu cuda app) are running fine but I noticed that
kokkos example named "query_device" gives a strange seg fault on this system: (I have added an output line right before call to cudaGetDeviceProperties in core/src/Cuda/Kokkos_Cuda_Impl.cpp (in CudaInternalDevices constructor) :

for ( int i = 0 ; i < m_cudaDevCount ; ++i ) {
printf("CudaInternalDevices: %d out of %d\n",i,m_cudaDevCount);
CUDA_SAFE_CALL( cudaGetDeviceProperties( m_cudaProp + i , i ) );
}

The output of query_device for this system (Centos 6.7, cuda7.5, gnu 4.9.3):
CudaInternalDevices: 0 out of 14
CudaInternalDevices: 1 out of 14
CudaInternalDevices: 2 out of 14
CudaInternalDevices: 3 out of 14
CudaInternalDevices: 4 out of 14
CudaInternalDevices: 5 out of 14
CudaInternalDevices: 6 out of 14
CudaInternalDevices: 7 out of 14
CudaInternalDevices: 8 out of 14
CudaInternalDevices: 9 out of 1819501908
CudaInternalDevices: 10 out of 1819501908
CudaInternalDevices: 11 out of 1819501908
CudaInternalDevices: 12 out of 1819501908
CudaInternalDevices: 13 out of 1819501908

Program received signal SIGSEGV, Segmentation fault.
__strlen_sse42 () at ../sysdeps/x86_64/multiarch/strlen-sse4.S:48

as if variable m_cudaDevCount was the victim of buffer overwrite.
I built kokkos with and without debug flag with the same behaviour; I tried to find a memory leak but was unable to find some (in cudart ?)

I tried to reboot the system (maybe a GPU driver issue) but even after reboot, the seg fault is there.

Finaly I don't really know what trigger this seg fault; any idea ?

@crtrott
Copy link
Member

crtrott commented Jun 28, 2016

A stupid internal hardcoded max of 8 devices.
Look in core/src/Cuda/Kokkos_Cuda_Impl.cpp:194.
Can you change that to 64 and retest?

@crtrott crtrott added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Jun 28, 2016
@crtrott crtrott added this to the Summer 2016 milestone Jun 28, 2016
@crtrott crtrott self-assigned this Jun 28, 2016
@crtrott
Copy link
Member

crtrott commented Jun 28, 2016

I made this a Bug. If changing that number fixes it I will put that into my next push to develop.

@pkestene
Copy link
Contributor Author

Thanks a lot. It was right in front of me and I missed it.

@crtrott
Copy link
Member

crtrott commented Jun 29, 2016

DId it work with the change?

@pkestene
Copy link
Contributor Author

Yes, yes. Thanks.

crtrott added a commit that referenced this issue Jul 1, 2016
Hard coded limit on number of GPUs per node was too small.
hcedwar pushed a commit to hcedwar/kokkos that referenced this issue Jul 11, 2016
Hard coded limit on number of GPUs per node was too small.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Projects
None yet
Development

No branches or pull requests

3 participants