Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE #3410

Closed
janglichao opened this issue Apr 29, 2023 · 7 comments
Labels
bug Something isn't working compression

Comments

@janglichao
Copy link

Describe the bug
runing step2 with script:

deepspeed DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/main.py
--data_split 2,4,4
--model_name_or_path facebook/opt-350m
--num_padding_at_beginning 1
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--max_seq_len 512
--learning_rate 5e-5
--weight_decay 0.1
--num_train_epochs 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--gradient_checkpointing
--seed 1234
--zero_stage 0
--deepspeed
--output_dir /home/kidd/projects/llms/chatGLM-6B/ChatGLM-6B/chatglm_efficient_tuning/DeepSpeedExamples/output
&> /home/kidd/projects/llms/chatGLM-6B/ChatGLM-6B/chatglm_efficient_tuning/DeepSpeedExamples/output/rm_training.log

then got errors:

CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.

ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol:
_ZN3c104cuda20CUDACachingAllocator9allocatorE

@janglichao janglichao added bug Something isn't working compression labels Apr 29, 2023
@emilankerwiik
Copy link

@janglichao having the same issue :)

usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE

Did you find a solution?

@janglichao
Copy link
Author

@janglichao having the same issue :)

usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE

Did you find a solution?

try to run ”ds_report" then you may see some ops doesn't install on you system,fused_adam ops should be installed

@niuhuluzhihao
Copy link

@emilankerwiik @janglichao I have the same question.Did you find a solution?
my ds_report is

(myenv) algo@algogpu:~/mzh/vicuna_0605/scripts$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/algo/.local/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/home/algo/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.3+unknown, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

@emilankerwiik
Copy link

emilankerwiik commented Jun 29, 2023

@summer-silence

If I am not mixing up my dependency issues I solved it with this. Best of luck!

!pip install -q xformers==0.0.19 torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchtext==0.15.1 torchaudio==2.0.1 torchdata==0.6.0 --extra-index-url https://download.pytorch.org/whl/cu117

#Solution to package problems and compile errors with different CUDA versions: updated with latest torch==2.0.0-compatible package versions. From AUTOMATIC1111/stable-diffusion-webui#9341 and AUTOMATIC1111/stable-diffusion-webui#4629

@Unicorncosmos
Copy link

@emilankerwiik hi

@summer-silence

If I am not mixing up my dependency issues I solved it with this. Best of luck!

!pip install -q xformers==0.0.19 torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchtext==0.15.1 torchaudio==2.0.1 torchdata==0.6.0 --extra-index-url https://download.pytorch.org/whl/cu117

#Solution to package problems and compile errors with different CUDA versions: updated with latest torch==2.0.0-compatible package versions. From AUTOMATIC1111/stable-diffusion-webui#9341 and AUTOMATIC1111/stable-diffusion-webui#4629

What Python version is used here?

@Justinfungi
Copy link

Clear the ./cache by rm -rf ~/.cache so that the cache file can be reset and updated. it is because you accidentally remove the source file for link file i thing

@zhzihao
Copy link

zhzihao commented May 17, 2024

I solved it with this: change the transformers to 4.37.2 and flash-attn to 2.4.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working compression
Projects
None yet
Development

No branches or pull requests

6 participants