You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When attempting to run a multi-node, multi-GPU training job using DeepSpeed, I encounter a series of compilation and import errors immediately after importing the model. The issues seem to stem from compiler compatibility, build failures, and a missing shared object file.
Error Messages:
[1/3] nvcc warning : The -std=c++17 flag is not supported with the configured host compiler. Flag will be ignored.
In file included from /mnt/petrelfs/suzhaochen/anaconda3/envs/sft_new/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:11:0:
/mnt/petrelfs/suzhaochen/anaconda3/envs/sft_new/lib/python3.10/site-packages/torch/include/ATen/ATen.h:4:2: error: #error C++14 or later compatible compiler is required to use ATen.
#error C++14 or later compatible compiler is required to use ATen.
[2/3] ninja: build stopped: subcommand failed.
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
[3/3] ImportError: /mnt/petrelfs/suzhaochen/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
I think this is the main error: error: #error C++14 or later compatible compiler is required to use ATen. What compiler do you have? Can you update to a compiler that supports C++17?
Issue Description:
When attempting to run a multi-node, multi-GPU training job using DeepSpeed, I encounter a series of compilation and import errors immediately after importing the model. The issues seem to stem from compiler compatibility, build failures, and a missing shared object file.
Error Messages:
[1/3] nvcc warning : The -std=c++17 flag is not supported with the configured host compiler. Flag will be ignored.
In file included from /mnt/petrelfs/suzhaochen/anaconda3/envs/sft_new/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:11:0:
/mnt/petrelfs/suzhaochen/anaconda3/envs/sft_new/lib/python3.10/site-packages/torch/include/ATen/ATen.h:4:2: error: #error C++14 or later compatible compiler is required to use ATen.
#error C++14 or later compatible compiler is required to use ATen.
[2/3] ninja: build stopped: subcommand failed.
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
[3/3] ImportError: /mnt/petrelfs/suzhaochen/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
Slurm training scripts
Environment
The text was updated successfully, but these errors were encountered: