Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined #2901

Closed
wqw547243068 opened this issue Feb 24, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@wqw547243068
Copy link

wqw547243068 commented Feb 24, 2023

🐛 Describe the bug

Installation

Installation steps:

# create and activate virtualenv
# install 
cd application/ChatGPT
pip install .
# test
cd examples
sh train_dummy.sh

First Error with virtualenv

Then, this error message pop up:

OSError: /root/bin/ai/lib/python3.7/site-packages/nvidia/cublas/lib/libcublas.so.11: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

Seems there is something wrong with CUDA ...

Second Error with CUDA

And when I change python environments with command deactivate, return to default python env, the error disappears and another one comes up

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 146 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 147) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_dummy.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-24_17:26:03
  host      : mlxlab6zo2knwh6360c359-20221101065730-6hyrm1-ejpeww-worker
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 147)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
INFO[0076] Worker 0 Status Failed                        host=10.22.148.79 message= reason=Error
error: exec command: 0

Why ?

CUDA error: invalid device ordinal

Environment

Environment:

  • python 3.7.3
  • pytorch: 1.13.1
  • cuda: 11.3
  • cpu: 1
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.191.01   Driver Version: 450.191.01   CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:8D:00.0 Off |                    0 |
| N/A   39C    P0    45W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
@wqw547243068 wqw547243068 added the bug Something isn't working label Feb 24, 2023
@wqw547243068 wqw547243068 changed the title [BUG] Installation error: OSError: /root/bin/ai/lib/python3.7/site-packages/nvidia/cublas/lib/libcublas.so.11: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference [BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference Feb 24, 2023
@wqw547243068 wqw547243068 changed the title [BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference [BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined Feb 24, 2023
@JThh
Copy link
Contributor

JThh commented Feb 27, 2023

It might happen due to mismatch of torch and cuda versions. Could you try reinstall torch via conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

@yctam
Copy link

yctam commented Apr 19, 2023

I removed the miniconda3/lib/python3.10/site-packages/nvidia/cublas and miniconda3/envs/cai/lib/python3.8/site-packages/torch/lib/libcublas*, and the error goes away.

@yustiks
Copy link

yustiks commented Apr 27, 2023

I removed the miniconda3/lib/python3.10/site-packages/nvidia/cublas and miniconda3/envs/cai/lib/python3.8/site-packages/torch/lib/libcublas*, and the error goes away.

Thank you! Worked for me

@ADongGu
Copy link

ADongGu commented Oct 19, 2023

because your dir have other dir,so delete them
因为你的目录多了其他原来不存在的, 所以你删掉那些不属于项目内容的文件夹

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants