import error, undefined symbol: ncclBroadcast #18

yyq · 2023-05-30T03:07:41Z

I'm trying the demo code, here is the information: with CUDA 12.1

the command !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 14, 3)

below is the original import error stack:

ImportError Traceback (most recent call last)
Cell In[10], line 1
----> 1 from cpm_live.generation.bee import CPMBeeBeamSearch
2 from cpm_live.models import CPMBeeTorch, CPMBeeConfig
3 from cpm_live.tokenizers import CPMBeeTokenizer

File /workspace/cpm_live/generation/init.py:1
----> 1 from .ant import CPMAntBeamSearch, CPMAntRandomSampling, CPMAntGeneration

File /workspace/cpm_live/generation/ant.py:4
2 import torch.nn.functional as F
3 from .generation_utils import BeamHypotheses, apply_repetition_penalty, top_k_top_p_filtering
----> 4 from ..utils import pad
7 class CPMAntGeneration:
8 def init(self, model, tokenizer, prompt_length=32):

File /workspace/cpm_live/utils/init.py:1
----> 1 from .config import Config
2 from .data_utils import pad
3 from .object import allgather_objects

File /workspace/cpm_live/utils/config.py:20
18 import copy
19 from typing import Any, Dict, Union
---> 20 from .log import logger
23 def load_dataset_config(dataset_path: str):
24 cfg = json.load(open(dataset_path, "r", encoding="utf-8"))

File /workspace/cpm_live/utils/log.py:7
5 import json
6 import logging
----> 7 import bmtrain as bmt
10 # Set up the common logger
11 def _get_logger():

File /usr/local/lib/python3.10/dist-packages/bmtrain/init.py:2
1 from .global_var import config, world_size, rank
----> 2 from .init import init_distributed
4 from .parameter import DistributedParameter, ParameterInitializer
5 from .layer import DistributedModule

File /usr/local/lib/python3.10/dist-packages/bmtrain/init.py:8
6 from .utils import print_dict
7 from .global_var import config
----> 8 from . import nccl
9 from .synchronize import synchronize
10 def init_distributed(
11 init_method : str = "env://",
12 seed : int = 0,
(...)
15 num_micro_batches: int = None,
16 ):

File /usr/local/lib/python3.10/dist-packages/bmtrain/nccl/init.py:4
2 from typing_extensions import Literal
3 import torch
----> 4 from . import _C as C
5 from .enums import *
7 class NCCLCommunicator:

ImportError: /usr/local/lib/python3.10/dist-packages/bmtrain/nccl/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: ncclBroadcast

The text was updated successfully, but these errors were encountered:

MayDomine · 2023-05-30T07:26:36Z

To ensure that the CUDA version used to compile your Torch C++ plugin matches the runtime version of your current CUDA Toolkit, you can use the following Python command:

import torch
print(torch.version.cuda)

This command will print the CUDA version that was used to compile PyTorch. Please ensure that this version matches the version of your installed CUDA Toolkit.

In addition, please note that PyTorch version 2.0.0 and above are not yet supported. You should ensure that your installed version of PyTorch is less than 2.0.0. You can check the PyTorch version with the following Python command:

import torch
print(torch.__version__)

If your PyTorch version is not compatible, please downgrade PyTorch to a compatible version using pip or conda, depending on how you initially installed PyTorch.

yyq · 2023-05-30T11:29:38Z

To ensure that the CUDA version used to compile your Torch C++ plugin matches the runtime version of your current CUDA Toolkit, you can use the following Python command:
import torch
print(torch.version.cuda)
This command will print the CUDA version that was used to compile PyTorch. Please ensure that this version matches the version of your installed CUDA Toolkit.

In addition, please note that PyTorch version 2.0.0 and above are not yet supported. You should ensure that your installed version of PyTorch is less than 2.0.0. You can check the PyTorch version with the following Python command:
import torch
print(torch.__version__)
If your PyTorch version is not compatible, please downgrade PyTorch to a compatible version using pip or conda, depending on how you initially installed PyTorch.

I tried downgrade to torch.version.cuda=11.7 and touch__version__=1.13.1+cu117, still the same error.

MayDomine · 2023-05-31T04:03:07Z

torch.version.cuda=11.7 and torch__version__=1.13.1+cu117 only means the cuda version used to compile torch is 11.7.You need to make sure that the CUDA Toolkit version matches the version used to compile torch.
You can use nvidia-smi or nvcc --version to check the version of CUDA Toolkit.

MathamPollard · 2023-05-31T08:08:16Z

cuda version:11.3
torch version: 1.12.1
print(torch.version.cuda):11.3
print(torch.cuda.is_available()): True
!python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3)

still the same error
#26

MayDomine · 2023-05-31T08:33:41Z

please ensure that you have tried pip install bmtrain --no-cache-dir.

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3)

still the same error #26

diaojunxian · 2023-06-08T09:09:51Z

please ensure that you have tried pip install bmtrain --no-cache-dir.

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3)
still the same error #26

@MayDomine hi, my server environment, also had the errors.

torch == 1.13.1+cu117

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

LLMChild · 2023-06-08T09:42:36Z

please ensure that you have tried pip install bmtrain --no-cache-dir.

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3)
still the same error #26

@MayDomine hi, my server environment, also had the errors.
torch == 1.13.1+cu117

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

这个环境我测试过不会出错，请检查cuda runtime的路径，pip安装是否使用cache、以及本地nccl版本是否有冲突等等

diaojunxian · 2023-06-08T10:28:53Z

please ensure that you have tried pip install bmtrain --no-cache-dir.

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3)
still the same error #26

@MayDomine hi, my server environment, also had the errors.
torch == 1.13.1+cu117

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
这个环境我测试过不会出错，请检查cuda runtime的路径，pip安装是否使用cache、以及本地nccl版本是否有冲突等等

python -c "import torch;print(torch.cuda.nccl.version())"
执行有结果：(2, 14, 3)

locate nccl| grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'
执行有结果：2

我在用 transformers 进行训练的时候:

CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/.conda/envs/3.9/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...

@Fword4u 你好，我这边检查的环境是这样，实在看不出来哪里环境配置有冲突；

diaojunxian · 2023-06-09T01:48:05Z

pip install bmtrain --no-cache-dir

我执行这个 pip install bmtrain --no-cache-dir现在不报错了，想知道原因；

zh-zheng mentioned this issue May 31, 2023

微调是否要装nccl? #26

Closed

zh-zheng added the Environment issues related to system environment label May 31, 2023

benkerd22 mentioned this issue Jun 8, 2023

fix: missing nccl OpenBMB/BMTrain#111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import error, undefined symbol: ncclBroadcast #18

import error, undefined symbol: ncclBroadcast #18

yyq commented May 30, 2023

MayDomine commented May 30, 2023

yyq commented May 30, 2023

MayDomine commented May 31, 2023

MathamPollard commented May 31, 2023

MayDomine commented May 31, 2023 •

edited

Loading

diaojunxian commented Jun 8, 2023

LLMChild commented Jun 8, 2023

diaojunxian commented Jun 8, 2023 •

edited

Loading

diaojunxian commented Jun 9, 2023

import error, undefined symbol: ncclBroadcast #18

import error, undefined symbol: ncclBroadcast #18

Comments

yyq commented May 30, 2023

MayDomine commented May 30, 2023

yyq commented May 30, 2023

MayDomine commented May 31, 2023

MathamPollard commented May 31, 2023

MayDomine commented May 31, 2023 • edited Loading

diaojunxian commented Jun 8, 2023

LLMChild commented Jun 8, 2023

diaojunxian commented Jun 8, 2023 • edited Loading

diaojunxian commented Jun 9, 2023

MayDomine commented May 31, 2023 •

edited

Loading

diaojunxian commented Jun 8, 2023 •

edited

Loading