[BUG]ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE #3410

janglichao · 2023-04-29T13:39:04Z

Describe the bug
runing step2 with script:

deepspeed DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/main.py
--data_split 2,4,4
--model_name_or_path facebook/opt-350m
--num_padding_at_beginning 1
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--max_seq_len 512
--learning_rate 5e-5
--weight_decay 0.1
--num_train_epochs 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--gradient_checkpointing
--seed 1234
--zero_stage 0
--deepspeed
--output_dir /home/kidd/projects/llms/chatGLM-6B/ChatGLM-6B/chatglm_efficient_tuning/DeepSpeedExamples/output
&> /home/kidd/projects/llms/chatGLM-6B/ChatGLM-6B/chatglm_efficient_tuning/DeepSpeedExamples/output/rm_training.log

then got errors:

CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.

ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol:
_ZN3c104cuda20CUDACachingAllocator9allocatorE

emilankerwiik · 2023-05-05T11:56:20Z

@janglichao having the same issue :)

usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE

Did you find a solution?

janglichao · 2023-05-05T12:42:52Z

@janglichao having the same issue :)

usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE

Did you find a solution?

try to run ”ds_report" then you may see some ops doesn't install on you system,fused_adam ops should be installed

niuhuluzhihao · 2023-06-20T17:18:09Z

@emilankerwiik @janglichao I have the same question.Did you find a solution?
my ds_report is

(myenv) algo@algogpu:~/mzh/vicuna_0605/scripts$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/algo/.local/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/home/algo/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.3+unknown, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

emilankerwiik · 2023-06-29T09:20:48Z

@summer-silence

If I am not mixing up my dependency issues I solved it with this. Best of luck!

!pip install -q xformers==0.0.19 torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchtext==0.15.1 torchaudio==2.0.1 torchdata==0.6.0 --extra-index-url https://download.pytorch.org/whl/cu117

#Solution to package problems and compile errors with different CUDA versions: updated with latest torch==2.0.0-compatible package versions. From AUTOMATIC1111/stable-diffusion-webui#9341 and AUTOMATIC1111/stable-diffusion-webui#4629

Unicorncosmos · 2023-10-16T19:10:51Z

@emilankerwiik hi

@summer-silence

If I am not mixing up my dependency issues I solved it with this. Best of luck!

!pip install -q xformers==0.0.19 torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchtext==0.15.1 torchaudio==2.0.1 torchdata==0.6.0 --extra-index-url https://download.pytorch.org/whl/cu117

#Solution to package problems and compile errors with different CUDA versions: updated with latest torch==2.0.0-compatible package versions. From AUTOMATIC1111/stable-diffusion-webui#9341 and AUTOMATIC1111/stable-diffusion-webui#4629

What Python version is used here?

Justinfungi · 2024-04-25T00:07:36Z

Clear the ./cache by rm -rf ~/.cache so that the cache file can be reset and updated. it is because you accidentally remove the source file for link file i thing

zhzihao · 2024-05-17T11:45:13Z

I solved it with this: change the transformers to 4.37.2 and flash-attn to 2.4.2

janglichao added bug Something isn't working compression labels Apr 29, 2023

janglichao closed this as completed May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE #3410

[BUG]ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE #3410

janglichao commented Apr 29, 2023

emilankerwiik commented May 5, 2023

janglichao commented May 5, 2023

niuhuluzhihao commented Jun 20, 2023

emilankerwiik commented Jun 29, 2023 •

edited

Loading

Unicorncosmos commented Oct 16, 2023

Justinfungi commented Apr 25, 2024

zhzihao commented May 17, 2024

[BUG]ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE #3410

[BUG]ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE #3410

Comments

janglichao commented Apr 29, 2023

emilankerwiik commented May 5, 2023

janglichao commented May 5, 2023

niuhuluzhihao commented Jun 20, 2023

emilankerwiik commented Jun 29, 2023 • edited Loading

Unicorncosmos commented Oct 16, 2023

Justinfungi commented Apr 25, 2024

zhzihao commented May 17, 2024

emilankerwiik commented Jun 29, 2023 •

edited

Loading