Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import error, undefined symbol: ncclBroadcast #18

Open
yyq opened this issue May 30, 2023 · 9 comments
Open

import error, undefined symbol: ncclBroadcast #18

yyq opened this issue May 30, 2023 · 9 comments
Labels
Environment issues related to system environment

Comments

@yyq
Copy link

yyq commented May 30, 2023

I'm trying the demo code, here is the information: with CUDA 12.1

the command !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 14, 3)

below is the original import error stack:


ImportError Traceback (most recent call last)
Cell In[10], line 1
----> 1 from cpm_live.generation.bee import CPMBeeBeamSearch
2 from cpm_live.models import CPMBeeTorch, CPMBeeConfig
3 from cpm_live.tokenizers import CPMBeeTokenizer

File /workspace/cpm_live/generation/init.py:1
----> 1 from .ant import CPMAntBeamSearch, CPMAntRandomSampling, CPMAntGeneration

File /workspace/cpm_live/generation/ant.py:4
2 import torch.nn.functional as F
3 from .generation_utils import BeamHypotheses, apply_repetition_penalty, top_k_top_p_filtering
----> 4 from ..utils import pad
7 class CPMAntGeneration:
8 def init(self, model, tokenizer, prompt_length=32):

File /workspace/cpm_live/utils/init.py:1
----> 1 from .config import Config
2 from .data_utils import pad
3 from .object import allgather_objects

File /workspace/cpm_live/utils/config.py:20
18 import copy
19 from typing import Any, Dict, Union
---> 20 from .log import logger
23 def load_dataset_config(dataset_path: str):
24 cfg = json.load(open(dataset_path, "r", encoding="utf-8"))

File /workspace/cpm_live/utils/log.py:7
5 import json
6 import logging
----> 7 import bmtrain as bmt
10 # Set up the common logger
11 def _get_logger():

File /usr/local/lib/python3.10/dist-packages/bmtrain/init.py:2
1 from .global_var import config, world_size, rank
----> 2 from .init import init_distributed
4 from .parameter import DistributedParameter, ParameterInitializer
5 from .layer import DistributedModule

File /usr/local/lib/python3.10/dist-packages/bmtrain/init.py:8
6 from .utils import print_dict
7 from .global_var import config
----> 8 from . import nccl
9 from .synchronize import synchronize
10 def init_distributed(
11 init_method : str = "env://",
12 seed : int = 0,
(...)
15 num_micro_batches: int = None,
16 ):

File /usr/local/lib/python3.10/dist-packages/bmtrain/nccl/init.py:4
2 from typing_extensions import Literal
3 import torch
----> 4 from . import _C as C
5 from .enums import *
7 class NCCLCommunicator:

ImportError: /usr/local/lib/python3.10/dist-packages/bmtrain/nccl/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: ncclBroadcast

@MayDomine
Copy link

To ensure that the CUDA version used to compile your Torch C++ plugin matches the runtime version of your current CUDA Toolkit, you can use the following Python command:

import torch
print(torch.version.cuda)

This command will print the CUDA version that was used to compile PyTorch. Please ensure that this version matches the version of your installed CUDA Toolkit.

In addition, please note that PyTorch version 2.0.0 and above are not yet supported. You should ensure that your installed version of PyTorch is less than 2.0.0. You can check the PyTorch version with the following Python command:

import torch
print(torch.__version__)

If your PyTorch version is not compatible, please downgrade PyTorch to a compatible version using pip or conda, depending on how you initially installed PyTorch.

@yyq
Copy link
Author

yyq commented May 30, 2023

To ensure that the CUDA version used to compile your Torch C++ plugin matches the runtime version of your current CUDA Toolkit, you can use the following Python command:

import torch
print(torch.version.cuda)

This command will print the CUDA version that was used to compile PyTorch. Please ensure that this version matches the version of your installed CUDA Toolkit.

In addition, please note that PyTorch version 2.0.0 and above are not yet supported. You should ensure that your installed version of PyTorch is less than 2.0.0. You can check the PyTorch version with the following Python command:

import torch
print(torch.__version__)

If your PyTorch version is not compatible, please downgrade PyTorch to a compatible version using pip or conda, depending on how you initially installed PyTorch.

I tried downgrade to torch.version.cuda=11.7 and touch__version__=1.13.1+cu117, still the same error.

@MayDomine
Copy link

torch.version.cuda=11.7 and torch__version__=1.13.1+cu117 only means the cuda version used to compile torch is 11.7.You need to make sure that the CUDA Toolkit version matches the version used to compile torch.
You can use nvidia-smi or nvcc --version to check the version of CUDA Toolkit.

@MathamPollard
Copy link

cuda version:11.3
torch version: 1.12.1
print(torch.version.cuda):11.3
print(torch.cuda.is_available()): True
!python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3)

still the same error
#26

@MayDomine
Copy link

MayDomine commented May 31, 2023

please ensure that you have tried pip install bmtrain --no-cache-dir.

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3)

still the same error #26

@zh-zheng zh-zheng added the Environment issues related to system environment label May 31, 2023
@diaojunxian
Copy link

please ensure that you have tried pip install bmtrain --no-cache-dir.

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3)
still the same error #26

@MayDomine hi, my server environment, also had the errors.

torch == 1.13.1+cu117

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

@LLMChild
Copy link

LLMChild commented Jun 8, 2023

please ensure that you have tried pip install bmtrain --no-cache-dir.

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3)
still the same error #26

@MayDomine hi, my server environment, also had the errors.

torch == 1.13.1+cu117

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

这个环境我测试过不会出错,请检查cuda runtime的路径,pip安装是否使用cache、以及本地nccl版本是否有冲突等等

@diaojunxian
Copy link

diaojunxian commented Jun 8, 2023

please ensure that you have tried pip install bmtrain --no-cache-dir.

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3)
still the same error #26

@MayDomine hi, my server environment, also had the errors.

torch == 1.13.1+cu117

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

这个环境我测试过不会出错,请检查cuda runtime的路径,pip安装是否使用cache、以及本地nccl版本是否有冲突等等

python -c "import torch;print(torch.cuda.nccl.version())"
执行有结果:(2, 14, 3)

locate nccl| grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'
执行有结果:2

我在用 transformers 进行训练的时候:

CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/.conda/envs/3.9/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...

@Fword4u 你好,我这边检查的环境是这样,实在看不出来哪里环境配置有冲突;

@diaojunxian
Copy link

pip install bmtrain --no-cache-dir

我执行这个 pip install bmtrain --no-cache-dir现在不报错了,想知道原因;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Environment issues related to system environment
Projects
None yet
Development

No branches or pull requests

6 participants