Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency detected by ld.so: dl-version.c: 224: _dl_check_map_versions: Assertion `needed != NULL' failed! #9754

Closed
Feywell opened this issue Nov 13, 2021 · 19 comments · Fixed by #17365
Labels
api issues related to all other APIs: C, C++, Python, etc.

Comments

@Feywell
Copy link

Feywell commented Nov 13, 2021

Describe the bug
use onnxruntime-gpu inference my own onnx model. It works well when I use data in cpu device.
But there is a error throwed when I use data in gpu device.

It works well by this code:
ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(img.numpy())
It will fail by this:
ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(img_lq.numpy(), device_type="cuda", device_id=0)
The error is:

Inconsistency detected by ld.so: dl-version.c: 224: _dl_check_map_versions: Assertion `needed != NULL' failed!

Urgency
If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • ONNX Runtime installed from (source or binary): pypi
  • ONNX Runtime version: onnxruntime-gpu 1.9.0
  • Python version: Python 3.6.13
  • Visual Studio version (if applicable):
  • GCC/Compiler version (if compiling from source): GCC 7.3.0
  • CUDA/cuDNN version: cudatoolkit 10.1.243
  • GPU model and memory: gtx 2080ti

To Reproduce

  • Describe steps/code to reproduce the behavior.
  • Attach the ONNX model to the issue (where applicable) to expedite investigation.

Expected behavior
Any help to use gpu version tor inference onnx model?

Screenshots
image

@yuslepukhin
Copy link
Member

I have searched the internet and this appears to be relevant.

@yuslepukhin
Copy link
Member

Other results suggest that it may be due to a missing library. For that you would have to use ldd to search for libraries dependencies and find out what is missing on the system. Unfortunately, it may be some obscure library.

@snnn
Copy link
Member

snnn commented Nov 16, 2021

It was caused by the patchelf tool we use.
An alternative solution: Patch https://github.com/pypa/auditwheel, add a custom policy file to whitelist CUDA libraries. Then we can remove our hacks in setup.py.

The warnings generated by ldd doesn't impact use, except you can't use ldd with it.

@adk9
Copy link
Contributor

adk9 commented Nov 24, 2021

I'm also running into the same issue with onnxruntime-training (1.9.0) and onnxruntime-gpu (1.9.0) wheels installed from PyPI, trying to train a simple model using the CUDA EP. @snnn, the suggested fix above is not clear to me; can you elaborate on it?

@snnn
Copy link
Member

snnn commented Nov 24, 2021

First, onnxruntime python packages, "onnxruntime" and "onnxruntime-gpu", follow manylinux2014(pep-0599 ) standard. But the gpu one, onnxruntime-gpu, isn't fully compliant.

The PEP 599 policy says: "The wheel's binary executables or shared objects may not link against externally-provided libraries except those in the following list"

  • libgcc_s.so.1
  • libstdc++.so.6
  • libm.so.6
  • libdl.so.2
  • librt.so.1
  • libc.so.6
  • libnsl.so.1
  • libutil.so.1
  • libpthread.so.0
  • libresolv.so.2
  • libX11.so.6
  • libXext.so.6
  • libXrender.so.1
  • libICE.so.6
  • libSM.so.6
  • libGL.so.1
  • libgobject-2.0.so.0
  • libgthread-2.0.so.0
  • libglib-2.0.so.0

But we need CUDA. And CUDA isn't in the list. BTW, if you run ldd with onnxruntime's cpu only package, "onnxruntime", you won't see the error.

The policy was designed as any external dependency should be packed into the wheel file. However, we can't. Because,

  1. It will make the package over size. Pypi.org has a 100MB per wheel file size limit.
  2. It could cause license issues. I'm not sure if we could redistribute Nvidia's CUDA libraries.

So we did a dirty hack. Before pack the wheel, we patch the so file to pretend it doesn't depend on CUDA. To cheat on manylinux's auditwheel tool. Then we pack the wheel and manually load the CUDA libraries. The error message you saw is caused the tool for patching *.so files: patchelf. If we don't use the tool, we won't have this issue.

Alternatively, we could modify the policy. Patch the auditwheel tool, add a custom policy file to whitelist CUDA libraries. The file is: https://github.com/pypa/auditwheel/blob/main/src/auditwheel/policy/manylinux-policy.json . See #144 for more information.

(Hi @adk9, the above answer is only for onnxruntime inference packages. onnxruntime-training package is built in a special way that I'm not familiar. )

@GuillaumeTong
Copy link

Hi @snnn thank you for the detailed explanation of the problem. If I understand correctly, this is basically a problem that stems from not being able to specify CUDA as a pip dependency?

I am afraid that as a small potato, not familiar with the deeper workings of pip and auditwheel, I am unable to implement the fix you are describing.

Could the problem be fixed by manually installing a compatible version of CUDA?

Otherwise, could you give a more line-by line set of instruction on what files to edit and what actions to take after the edit (I take it I need to build the wheel after editing the rules? I have never done that before)?

@snnn
Copy link
Member

snnn commented Feb 14, 2022

Could the problem be fixed by manually installing a compatible version of CUDA?

Yes. Then ldd will still not work, but the onnxruntime python package should be good.

@GuillaumeTong
Copy link

GuillaumeTong commented Feb 15, 2022

In my specific case, after looking at the onnx runtime requirements again, I noticed that I might be missing cudnn.
I tried installing libcudnn8 and libcudnn8-dev, and I was able to run my code successfully. Not quite sure if libcudnn8-dev was necessary or if libcudnn8 would have been sufficient.

@VikasOjha666
Copy link

VikasOjha666 commented Apr 13, 2022

@GuillaumeTong @snnn I am facing the same issue. My cuda version is 10.2 followed the same installation instructions for cudnn for cuda 10.2 still getting the same issue. My OS is Ubuntu 18 on AWS with T4 GPU. What could be the reason?. I also have CUDA 11.6 on my system as well. This problem is arriving in my case with onnxruntime-training installation. Below is the code on execution of which the error is occurring:

from onnxruntime import OrtValue
import numpy as np
x = OrtValue.ortvalue_from_numpy(np.random.rand(3), 'cuda')

works for:
x = OrtValue.ortvalue_from_numpy(np.random.rand(3), 'cpu')

@snnn
Copy link
Member

snnn commented Apr 14, 2022

I'm not familiar with the onnxruntime-training package. @askhade, could you please help? I think there might be some code in pytorch ran this "ldd" command. Do you know how to reproduce it?

@farzanehnakhaee70
Copy link

It also solved my issue when upgrading the CUDA to the compatible versions with the onnxruntime-gpu. Thanks a lot.

@camblomquist
Copy link

I'm in a bit of a similar pickle here, though it might be one outside the scope of this issue or project. Environment includes Ubuntu 18.04, CUDA 11.4, CUDNN 8.2.4, Python 3.6, onnxruntime-gpu coming from pip. I'm packing up the project as a onedir executable using PyInstaller.
On the build machine, it appears to work as expected. When taking the packaged executable onto a different machine, I get hit with the Inconsistency detected assertion failure. This different machine is also running with CUDA 11.4, but uses a different GPU. For reasons, I'm trying to avoid modifying the system itself on this machine, so it needs to be able to work using the libraries included by PyInstaller.
PyInstaller appears to be properly including all of the relevant CUDA libraries and I've manually specified the inclusion of the onnxruntime_provider_cuda.so and *_shared.so files. For giggles, removing the CUDA libraries from the directory does correctly cause the program to error out due to missing files rather than assertions.
When running with LD_DEBUG=all, the last line before the assertion fail is Checking for version 'libcufft.so.10'... where the rest of it is just specifying that it's required by the onnxruntime_provider library. I don't know if this information is of any use.

@camblomquist
Copy link

My apologies for the ramble. Desperation tends to do that. I had resolved the issue on my own. It turns out that PyInstaller was not including all of the necessary CUDA libraries. Including them manually allowed onnxruntime to start up (and then crash when it couldn't find the cudnn_*_infer libraries, but that error was transparent.)

I will say though that it is incredibly frustrating to have spent the time on what ended up being a fairly simple issue. I understand that the import hack is done to avoid the eyes of the auditor, but this same hack made it much more difficult to realize that it was just a matter of a missing dependency. I know I'm barking up the wrong tree here since I could've just not used Python, but this type of error would've happened much sooner in the pipeline and likely had a more useful error message in any compiled language.

@biendltb
Copy link

biendltb commented Aug 4, 2022

If you are running with TensorrtExecutionProvider, reinstall libnvinfer libraries solves the issue in my case:

  • Uninstall libnvinfer libraries:
sudo apt-get purge "libnvinfer*"
  • Install all libnvinfer libraries for your cuda version. Run apt-cache policy libnvinfer8 to check available versions.

For cuda-11.4 and libnvinfer 8.2.5.1:

sudo apt install libnvinfer8=8.2.5-1+cuda11.4 libnvinfer-plugin8=8.2.5-1+cuda11.4 libnvparsers8=8.2.5-1+cuda11.4 libnvonnxparsers8=8.2.5-1+cuda11.4 libnvinfer-dev=8.2.5-1+cuda11.4 libnvinfer-plugin-dev=8.2.5-1+cuda11.4 libnvparsers-dev=8.2.5-1+cuda11.4 libnvonnxparsers-dev=8.2.5-1+cuda11.4 cuda-cudart-dev-11-4 libcublas-dev-11-4

For cuda-11.6 and libinfer 8.4.3 (tested also with cuda-11.8):

sudo apt install libnvinfer8=8.4.3-1+cuda11.6 libnvinfer-plugin8=8.4.3-1+cuda11.6 libnvparsers8=8.4.3-1+cuda11.6 libnvonnxparsers8=8.4.3-1+cuda11.6 libnvinfer-dev=8.4.3-1+cuda11.6 libnvinfer-plugin-dev=8.4.3-1+cuda11.6 libnvinfer-bin=8.4.3-1+cuda11.6 libnvparsers-dev=8.4.3-1+cuda11.6 libnvonnxparsers-dev=8.4.3-1+cuda11.6 cuda-cudart-dev-11-6 libcublas-dev-11-6

Hope it helps.

@sophies927 sophies927 added api issues related to all other APIs: C, C++, Python, etc. and removed api:Python labels Aug 12, 2022
@gorkemgoknar
Copy link

I hit the similar issue.
Apparently you need to use correct onnxruntime-gpu version relevant to your system cuda installation.
I have cuda 10.2 installed, after installing onnxruntime-gpu==1.6 I no longer saw this same error.
Onnxruntime requirements:
https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements

@Apisteftos
Copy link

First, onnxruntime python packages, "onnxruntime" and "onnxruntime-gpu", follow manylinux2014(pep-0599 ) standard. But the gpu one, onnxruntime-gpu, isn't fully compliant.

The PEP 599 policy says: "The wheel's binary executables or shared objects may not link against externally-provided libraries except those in the following list"

  • libgcc_s.so.1
  • libstdc++.so.6
  • libm.so.6
  • libdl.so.2
  • librt.so.1
  • libc.so.6
  • libnsl.so.1
  • libutil.so.1
  • libpthread.so.0
  • libresolv.so.2
  • libX11.so.6
  • libXext.so.6
  • libXrender.so.1
  • libICE.so.6
  • libSM.so.6
  • libGL.so.1
  • libgobject-2.0.so.0
  • libgthread-2.0.so.0
  • libglib-2.0.so.0

But we need CUDA. And CUDA isn't in the list. BTW, if you run ldd with onnxruntime's cpu only package, "onnxruntime", you won't see the error.

The policy was designed as any external dependency should be packed into the wheel file. However, we can't. Because,

  1. It will make the package over size. Pypi.org has a 100MB per wheel file size limit.
  2. It could cause license issues. I'm not sure if we could redistribute Nvidia's CUDA libraries.

So we did a dirty hack. Before pack the wheel, we patch the so file to pretend it doesn't depend on CUDA. To cheat on manylinux's auditwheel tool. Then we pack the wheel and manually load the CUDA libraries. The error message you saw is caused the tool for patching *.so files: patchelf. If we don't use the tool, we won't have this issue.

Alternatively, we could modify the policy. Patch the auditwheel tool, add a custom policy file to whitelist CUDA libraries. The file is: https://github.com/pypa/auditwheel/blob/main/src/auditwheel/policy/manylinux-policy.json . See #144 for more information.

(Hi @adk9, the above answer is only for onnxruntime inference packages. onnxruntime-training package is built in a special way that I'm not familiar. )

Can you provide bit more help about how to patch the auditwheel tool and create a custom policy file to whitelist CUDA libraries? What are the steps? I have the same issue, I downgraded my cuda version from 11.8 to 11.6 but the problem still remains. I am trying to inference some images, but it works only for cpu and unfortunately not for cuda.

@yongjer
Copy link

yongjer commented Jun 17, 2023

I have the same issue, I'm using docker env
FROM nvcr.io/nvidia/pytorch:23.02-py3 RUN apt-get update && pip install \ transformers \ datasets \ accelerate \ optimum[onnxruntime-gpu] \ diffusers \ evaluate \ jupyter \ notebook \ && \ rm -rf /var/lib/apt/lists/

@mattip
Copy link

mattip commented Jul 27, 2023

Stumbled across this issue. FWIW, in newer auditwheel there is an option to exclude shared objects that will be provided in a different manner. This is the PR that added the option --exclude pypa/auditwheel#368, specifically for the use case described here:

$ auditwheel repair --help
usage: auditwheel repair [-h] [--plat PLATFORM] [-L LIB_SDIR] [-w WHEEL_DIR] [--no-update-tags] [--strip] [--exclude EXCLUDE]
                         [--only-plat]
                         WHEEL_FILE [WHEEL_FILE ...]

Vendor in external shared library dependencies of a wheel.
If multiple wheels are specified, an error processing one
wheel will abort processing of subsequent wheels.
...
 --exclude EXCLUDE     Exclude SONAME from grafting into the resulting wheel (can be specified multiple times)

@snnn
Copy link
Member

snnn commented Aug 15, 2023

I will need to rework the PR #1282 .

snnn added a commit that referenced this issue Sep 8, 2023
### Description
Resolve #9754
snnn added a commit that referenced this issue Sep 15, 2023
### Description
Resolve #9754
snnn added a commit that referenced this issue Sep 18, 2023
### Description
1. Delete Prefast tasks (#17522)
2. Disable yum update (#17551)
3. Avoid calling patchelf (#17365 and #17562) we that we can validate
the above fix

The main problem I'm trying to solve is: our GPU package depends on both
CUDA 11.x and CUDA 12.x . However, it's not easy to see the information
because ldd doesn't work with the shared libraries we generate(see issue
#9754) . So the patchelf change are useful for me to validate the
"Disabling yum update" was successful. As you can see we call "yum
update" from multiple places. Without some kind of validation it's hard
to say if I have covered all of them.
The Prefast change is needed because I'm going to update the VM images
in the next a few weeks. In case of we need to publish a patch release
after that.

### Motivation and Context
Without this fix we will mix using CUDA 11.x and CUDA 12.x. And it will
crash every time when we use TensorRT.
kleiti pushed a commit to kleiti/onnxruntime that referenced this issue Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api issues related to all other APIs: C, C++, Python, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.