Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange link error on ubuntu 22.04 #191

Closed
pkestene opened this issue Apr 8, 2023 · 6 comments
Closed

Strange link error on ubuntu 22.04 #191

pkestene opened this issue Apr 8, 2023 · 6 comments

Comments

@pkestene
Copy link
Contributor

pkestene commented Apr 8, 2023

Hello,

When trying to build owl on Ubuntu 22.04, I noticed an error message at link time that I'm not able to fix.

The link error only happends when building with nvcc (from toolkit), but disappear when building with nvc++(from nvhpc).

Here is the error message I get when building with cuda-11.8 (toolkit), Optix 7.5.0 and g++-11 (default on ubuntu 22.04), it complains about not find cudaGetDeviceProperties_v2

[ 95%] Linking CXX executable ../../../sample00-rayGenOnly
cd /data/pkestene/install/Visu/ExaBrick/git/owl/_build/cuda-11.8-optix-7.5.0/samples/cmdline/s00-rayGenOnly && /home/pkestene/local/miniconda3/envs/eclairs-p39/bin/cmake -E cmake_link_script CMakeFiles/sample00-rayGenOnly.dir/link.txt --verbose=1
/usr/bin/c++ -O3 -DNDEBUG CMakeFiles/sample00-rayGenOnly.dir/hostCode.cpp.o CMakeFiles/sample00-rayGenOnly-ptx.dir/sample00-rayGenOnly-ptx.c.o -o ../../../sample00-rayGenOnly   -L/usr/local/cuda-11.8/targets/x86_64-linux/lib/stubs  -L/usr/local/cuda-11.8/targets/x86_64-linux/lib  ../../../libowl.a /usr/lib/x86_64-linux-gnu/libcuda.so /usr/local/cuda-11.8/lib64/libcudart_static.a -ldl /usr/lib/x86_64-linux-gnu/librt.a -lcudadevrt -lcudart_static -lrt -lpthread -ldl 
/usr/bin/ld: ../../../libowl.a(DeviceContext.cpp.o): in function `owl::DeviceContext::getDeviceName[abi:cxx11]() const':
DeviceContext.cpp:(.text+0x52f): undefined reference to `cudaGetDeviceProperties_v2'
collect2: error: ld returned 1 exit status
make[2]: *** [samples/cmdline/s00-rayGenOnly/CMakeFiles/sample00-rayGenOnly.dir/build.make:103: sample00-rayGenOnly] Error 1
make[2]: Leaving directory '/data/pkestene/install/Visu/ExaBrick/git/owl/_build/cuda-11.8-optix-7.5.0'
make[1]: *** [CMakeFiles/Makefile2:1525: samples/cmdline/s00-rayGenOnly/CMakeFiles/sample00-rayGenOnly.dir/all] Error 2
make[1]: Leaving directory '/data/pkestene/install/Visu/ExaBrick/git/owl/_build/cuda-11.8-optix-7.5.0'

Several remarks:

  • I tried several combinations of version cuda (toolkit) and optix ; they all give the same link error
  • if I comment the only call to cudaGetDeviceProperties(&prop, getCudaDeviceID() in DeviceContext.cpp, and just return an empty string, all the samples build/link and run ok. BTW, shouldn't we use the macro OWL_CUDA_CALL here ?
  • I don't know why the link problem only happens for this single API call

On this same machine, I can build other cuda app with nvcc toolkit without any problem.
So in the end, I have no idea what is the root cause of this link problem.
Any ideas ?

@pkestene
Copy link
Contributor Author

pkestene commented Apr 8, 2023

Looks like cudaGetDeviceProperties_v2 it actually missing in cuda toolkit

nm /usr/local/cuda-11.8/lib64/libcudart_static.a |grep cudaGetDeviceProperties
000000000003ec90 T cudaGetDeviceProperties

while it is ok in nvhpc

nm /data/pkestene/local/hpcsdk-23.3/Linux_x86_64/23.3/cuda/12.0/lib64/libcudart_static.a| grep cudaGetDeviceProperties
0000000000064010 T cudaGetDeviceProperties
000000000003e860 T cudaGetDeviceProperties_v2

it is also OK, when moving to cuda toolkit >=12.

@pkestene pkestene closed this as completed Apr 8, 2023
@ingowald
Copy link
Contributor

ingowald commented Apr 8, 2023

so just double-checking on this: this looks like you built with one cuda 11.8 install and then tried to run with another 11.8 install, and that was causing the issue? ie, both by themselves would have worked, just mixing them didn't?

(agreed that having to different distributions with the same version numbers if "funny", though :-) just trying to make sure that it's not related to owl.)

@pkestene
Copy link
Contributor Author

pkestene commented Apr 8, 2023

Just to be clear:

  • building with cuda 11.8 (from toolkit) fails, the error is at link
  • building with cuda (from nvhpc, here 12.0) is fine, and I run using the same env
  • building with cuda 12.0 (from toolkit) is fine, runs fine

The problem was that libcudart_static.a shipped with cuda 11.8 doesn't provide cudaGetDeviceProperties_v2.

At least, it's working now. My next step is playing with owlExaBrick.
Thanks for making this available.

@ingowald
Copy link
Contributor

ingowald commented Apr 8, 2023

Huh; that is "slightly" concerning. Basically what you're saying is that CUDA 11.8 is broken :-/. Huh. Now we have three options: a) try and fix the code even for cuda 11.8; b) go into cmakefile, detect cuda version, and at least throw an error; or c) ignore, and hope that people will use the newer cuda 12, anyway...
Anyway - thanks for reporting this - OS/toolchain related stuff is always nasty, in particular when version dependent...

@pkestene
Copy link
Contributor Author

pkestene commented Apr 9, 2023

I tried another machine under RedHat8, and cuda 11.8 (both toolkit and nvhpc), no problem there.

When I look for symbols cudaGetDeviceProperties/cudaGetDeviceProperties_v2 on that machine, I get the same results as on Ubuntu, that is:

> nm /ccc/products/cuda-11.8/system/toolkit/lib64/libcudart_static.a | grep cudaGetDeviceProper
000000000003ec90 T cudaGetDeviceProperties
> nm /ccc/products/cuda-12.0/system/toolkit/lib64/libcudart_static.a | grep cudaGetDeviceProper
0000000000064010 T cudaGetDeviceProperties
000000000003e860 T cudaGetDeviceProperties_v2

cuda-11.8 only contain one of them, cuda-12.0 contains them both.

But on RedHat, owl samples apps link fine with cuda-11.8, even though cudaGetDeviceProperties_v2 is not present.

@pkestene
Copy link
Contributor Author

pkestene commented Apr 9, 2023

I finally found the problem on my ubuntu machine; eventhough both cuda toolkit 11.8 and 12.0 where installed, in complete separated directories, when installing newer toolkit, by default the ubuntu package creates a sym link /usr/local/cuda -> /etc/alternatives/cuda -> /etc/alternatives/cuda-12.0
so /usr/local/cuda always points to the latest cuda toolkit installed.

so what happened is that, I was compiling with nvcc 11.8, the cuda headers where actually taken from 12.0. So there was a mismatch between the header version and the runtime library version.

So the question is why is /usr/local/cuda/include included in CUDA_INCLUDES ?

Finaly, I think the problem is there:
https://github.com/owl-project/owl/blob/master/owl/CMakeLists.txt#L165

the path /usr/local/cuda/include is unconditionnaly included.

But this path is really not needed if using alias library like CUDA::cudart_static.

A possible fix is to replace:

target_include_directories(owl
  PUBLIC
    /usr/local/cuda/include/
    ${PROJECT_SOURCE_DIR}
    ${CMAKE_CURRENT_LIST_DIR}/include
)

by

target_include_directories(owl
  PUBLIC
    ${PROJECT_SOURCE_DIR}
    ${CMAKE_CURRENT_LIST_DIR}/include
)

So that the path /usr/local/cuda/include/ is not added by default.

I think that definitely close the issue. I can provide a small if needed.

pkestene added a commit to pkestene/owl that referenced this issue Apr 10, 2023
On a system where several nvcc toolkit are installed, this path is often an
alias to the latest installed toolkit; hence when trying to build with an older
version off nvcc you end up in a situation where the old nvcc compiler is using
new header; this situation may lead to error at link time (undefined symbols).
See issue owl-project#191 for discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants