Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilation Errors with DeepSpeed on Multi-GPU Training Setup #5190

Open
zhaochen0110 opened this issue Feb 26, 2024 · 1 comment
Open

Compilation Errors with DeepSpeed on Multi-GPU Training Setup #5190

zhaochen0110 opened this issue Feb 26, 2024 · 1 comment
Assignees
Labels
bug Something isn't working training

Comments

@zhaochen0110
Copy link

Issue Description:

When attempting to run a multi-node, multi-GPU training job using DeepSpeed, I encounter a series of compilation and import errors immediately after importing the model. The issues seem to stem from compiler compatibility, build failures, and a missing shared object file.

Error Messages:

[1/3] nvcc warning : The -std=c++17 flag is not supported with the configured host compiler. Flag will be ignored.
In file included from /mnt/petrelfs/suzhaochen/anaconda3/envs/sft_new/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:11:0:
/mnt/petrelfs/suzhaochen/anaconda3/envs/sft_new/lib/python3.10/site-packages/torch/include/ATen/ATen.h:4:2: error: #error C++14 or later compatible compiler is required to use ATen.
#error C++14 or later compatible compiler is required to use ATen.
[2/3] ninja: build stopped: subcommand failed.
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
[3/3] ImportError: /mnt/petrelfs/suzhaochen/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory

Slurm training scripts

#!/usr/bin/bash


#SBATCH --job-name=70b_180k_sft
#SBATCH --output=/mnt/petrelfs/suzhaochen/tr-sft/MAmmoTH/logs_98/%x-%j.log
#SBATCH --error=/mnt/petrelfs/suzhaochen/tr-sft/MAmmoTH/logs_98/%x-%j.log

#SBATCH --partition=MoE
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=100
#SBATCH --mem=800G

#SBATCH --nodes=2
#SBATCH --gres=gpu:4
#SBATCH --quotatype=reserved

source ~/anaconda3/bin/activate sft_new



export MODEL_PATH='/mnt/petrelfs/share_data/quxiaoye/models/llama2_7B'
export OUTPUT_PATH="/mnt/petrelfs/suzhaochen/hugging-models/new_math_model/llama-70b-180k-cot"
num_nodes=2        # should match with --nodes
num_gpu_per_node=4 # should match with --gres
deepspeed_config_file=/mnt/petrelfs/suzhaochen/tr-sft/MAmmoTH/ds_config/ds_config_zero3.json


export NCCL_SOCKET_IFNAME=bond0
MASTER_ADDR=`scontrol show hostname $SLURM_JOB_NODELIST | head -n1`
MASTER_PORT=$((RANDOM % 101 + 20000))
echo $MASTER_ADDR
echo $MASTER_PORT
echo $SLURM_NODEID


srun torchrun --nnodes ${num_nodes} \
    --nproc_per_node ${num_gpu_per_node} \
    --rdzv_id $RANDOM --rdzv_backend c10d \
    --rdzv_endpoint "$MASTER_ADDR:$MASTER_PORT" \
    train.py \
    --model_name_or_path $MODEL_PATH \
    --data_path $Data_path \
    --bf16 True \
    --output_dir $OUTPUT_PATH \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 10000 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --model_max_length 2048 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --deepspeed ${deepspeed_config_file} \
    --tf32 True

Environment

  • python version: python 3.10.13
  • pytorch version 2.0.1
  • cuda 11.7
  • cudatoolkit 11.7.0
  • cudatoolkit-dev 11.7.0
  • Deepspeed 0.9.3
@zhaochen0110 zhaochen0110 added bug Something isn't working training labels Feb 26, 2024
@mrwyattii
Copy link
Contributor

I think this is the main error: error: #error C++14 or later compatible compiler is required to use ATen. What compiler do you have? Can you update to a compiler that supports C++17?

@mrwyattii mrwyattii self-assigned this Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants