[BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined #2901

wqw547243068 · 2023-02-24T09:20:12Z

🐛 Describe the bug

Installation

Installation steps:

# create and activate virtualenv
# install 
cd application/ChatGPT
pip install .
# test
cd examples
sh train_dummy.sh

First Error with virtualenv

Then, this error message pop up:

OSError: /root/bin/ai/lib/python3.7/site-packages/nvidia/cublas/lib/libcublas.so.11: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

Seems there is something wrong with CUDA ...

Second Error with CUDA

And when I change python environments with command deactivate, return to default python env, the error disappears and another one comes up

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 146 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 147) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_dummy.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-24_17:26:03
  host      : mlxlab6zo2knwh6360c359-20221101065730-6hyrm1-ejpeww-worker
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 147)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
INFO[0076] Worker 0 Status Failed                        host=10.22.148.79 message= reason=Error
error: exec command: 0

Why ?

CUDA error: invalid device ordinal

Environment

Environment:

python 3.7.3
pytorch: 1.13.1
cuda: 11.3
cpu: 1

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.191.01   Driver Version: 450.191.01   CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:8D:00.0 Off |                    0 |
| N/A   39C    P0    45W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

JThh · 2023-02-27T04:07:08Z

It might happen due to mismatch of torch and cuda versions. Could you try reinstall torch via conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

yctam · 2023-04-19T10:01:21Z

I removed the miniconda3/lib/python3.10/site-packages/nvidia/cublas and miniconda3/envs/cai/lib/python3.8/site-packages/torch/lib/libcublas*, and the error goes away.

yustiks · 2023-04-27T15:34:39Z

I removed the miniconda3/lib/python3.10/site-packages/nvidia/cublas and miniconda3/envs/cai/lib/python3.8/site-packages/torch/lib/libcublas*, and the error goes away.

Thank you! Worked for me

ADongGu · 2023-10-19T09:12:07Z

because your dir have other dir，so delete them
因为你的目录多了其他原来不存在的，所以你删掉那些不属于项目内容的文件夹

wqw547243068 added the bug Something isn't working label Feb 24, 2023

wqw547243068 changed the title ~~[BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference~~ [BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined Feb 24, 2023

wqw547243068 closed this as completed Mar 1, 2023

ajithkumarmcw mentioned this issue May 6, 2023

can you give a train.py file to train the detection model on custom dataset its complicated without it Deci-AI/super-gradients#929

Closed

Louis-Dupont mentioned this issue May 16, 2023

How to setup in Kaggle notebook? Deci-AI/super-gradients#1022

Closed

jychoi0616 mentioned this issue Aug 18, 2023

libcublas.so.11 error MPI-Dortmund/tomotwin-cryoet#42

Closed

This was referenced Oct 6, 2023

Failed to install on WSL2 - setuptools.sandbox.UnpickleableException MVIG-SJTU/AlphaPose#1147

Closed

Why is the inference code for video so slow? MVIG-SJTU/AlphaPose#1182

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined #2901

[BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined #2901

wqw547243068 commented Feb 24, 2023 •

edited

JThh commented Feb 27, 2023

yctam commented Apr 19, 2023 •

edited

yustiks commented Apr 27, 2023

ADongGu commented Oct 19, 2023

[BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined #2901

[BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined #2901

Comments

wqw547243068 commented Feb 24, 2023 • edited

🐛 Describe the bug

Installation

First Error with virtualenv

Second Error with CUDA

Environment

JThh commented Feb 27, 2023

yctam commented Apr 19, 2023 • edited

yustiks commented Apr 27, 2023

ADongGu commented Oct 19, 2023

wqw547243068 commented Feb 24, 2023 •

edited

yctam commented Apr 19, 2023 •

edited