-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE #3410
Comments
@janglichao having the same issue :) usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE Did you find a solution? |
try to run ”ds_report" then you may see some ops doesn't install on you system,fused_adam ops should be installed |
@emilankerwiik @janglichao I have the same question.Did you find a solution?
|
@summer-silence If I am not mixing up my dependency issues I solved it with this. Best of luck! !pip install -q xformers==0.0.19 torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchtext==0.15.1 torchaudio==2.0.1 torchdata==0.6.0 --extra-index-url https://download.pytorch.org/whl/cu117 #Solution to package problems and compile errors with different CUDA versions: updated with latest torch==2.0.0-compatible package versions. From AUTOMATIC1111/stable-diffusion-webui#9341 and AUTOMATIC1111/stable-diffusion-webui#4629 |
What Python version is used here? |
Clear the ./cache by |
I solved it with this: change the transformers to 4.37.2 and flash-attn to 2.4.2 |
Describe the bug
runing step2 with script:
deepspeed DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/main.py
--data_split 2,4,4
--model_name_or_path facebook/opt-350m
--num_padding_at_beginning 1
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--max_seq_len 512
--learning_rate 5e-5
--weight_decay 0.1
--num_train_epochs 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--gradient_checkpointing
--seed 1234
--zero_stage 0
--deepspeed
--output_dir /home/kidd/projects/llms/chatGLM-6B/ChatGLM-6B/chatglm_efficient_tuning/DeepSpeedExamples/output
&> /home/kidd/projects/llms/chatGLM-6B/ChatGLM-6B/chatglm_efficient_tuning/DeepSpeedExamples/output/rm_training.log
then got errors:
CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol:
_ZN3c104cuda20CUDACachingAllocator9allocatorE
The text was updated successfully, but these errors were encountered: