Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ibucc_tl_cuda.so: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType #496

Open
zasdfgbnm opened this issue May 3, 2022 · 7 comments
Open

Comments

@zasdfgbnm
Copy link
Contributor

I am seeing this error:

libucc_tl_cuda.so: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType

Thanks to @crcrpar who figured out that this is a new API https://github.com/NVIDIA/nvidia-settings/blame/5b455b89bb73f56818c84444806bc9c928da67ac/src/nvml.h#L6009-L6026

For older versions of drivers, is it possible to use other APIs to achieve similar functionality? Or at least detect the version and throw a kinder error message?

cc: @ptrblck

@jladd-mlnx
Copy link

@bureddy Can you take a look, please.

@vspetrov
Copy link
Collaborator

vspetrov commented May 4, 2022

Hi @zasdfgbnm actually existing autotool code does check for the presence of that function at compile time. Here:

[AC_CHECK_DECL([nvmlDeviceGetNvLinkRemoteDeviceType],
. So i guess it was available during compile time and in your case it is not available at runtime. This implies compile/runtime cuda versions mismatch. Could you plz check the env and confirm?

@crcrpar
Copy link
Contributor

crcrpar commented May 4, 2022

We're seeing the undefined symbol message when we run a container which has CUDA 11.6 on a host with an older driver

@bureddy
Copy link
Collaborator

bureddy commented May 4, 2022

what is the driver version? is it possible to choose the right cuda toolkit version in container?
https://docs.nvidia.com/deploy/cuda-compatibility/index.html
otherwise, I think you need to have cuda-compat-11.6 in the container for compatibility.

@ptrblck
Copy link

ptrblck commented May 4, 2022

The KMD was 460.73.01, UMD 510.47.03, and forward compat was used.

@bureddy
Copy link
Collaborator

bureddy commented May 4, 2022

It seems no forward compat for NVML (libnvidia-ml.so) unfortunately.

@crcrpar
Copy link
Contributor

crcrpar commented May 5, 2022

@bureddy what do you think about @zasdfgbnm's 2nd question?

Or at least detect the version and throw a kinder error message?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants