Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Invalid function argument. Expected parameter tensor to be of type #88

Closed
FannierPeng opened this issue Feb 19, 2021 · 3 comments

Comments

@FannierPeng
Copy link

I was trying to run Megatron with ZeRO 2 config when I encountered this error
The code version is Megatron-LM-v1.1.5-3D_parallelism.

Traceback (most recent call last):
File "pretrain_gpt2.py", line 158, in <module>
args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 100, in pretrain
train_data_iterator, valid_data_iterator)
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 485, in train
lr_scheduler)
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 325, in train_step
return train_step_pipe(model, data_iterator)
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 359, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 283, in train_batch
self._exec_schedule(sched)
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 1161, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 219, in _exec_reduce_tied_grads
self.module.allreduce_tied_weight_gradients()
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
dist.all_reduce(weight.grad, group=comm['group'])
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 890, in all_reduce
_check_single_tensor(tensor, "tensor")
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_single_tensor
"to be of type torch.Tensor.".format(param_name))
RuntimeError: Invalid function argument. Expected parameter `tensor` to be of type torch.Tensor.

It seems that File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
dist.all_reduce(weight.grad, group=comm['group'])
the weight.grad is not Tensor. But this error doesn't occur with ZERO 0 and 1 config

My script is like this:

#! /bin/bash

GPUS_PER_NODE=16
# Change for multinode config
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6000
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

export DLWS_NUM_WORKER=${NNODES}
export DLWS_NUM_GPU_PER_WORKER=${GPUS_PER_NODE}

DATA_PATH=/userhome/ChineseCorpus/Megatron-training/all-sample100G-samplebyfile-combine10M/text_document
VOCAB_PATH=bpe_3w_new/vocab.json
MERGE_PATH=bpe_3w_new/merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m_ds

script_path=$(realpath $0)
script_dir=$(dirname $script_path)
config_json="$script_dir/ds_zero_stage_2_config.json"
# config_json="$script_dir/ds_config.json"

# Megatron Model Parallelism
mp_size=2
# DeepSpeed Pipeline parallelism
pp_size=2

NLAYERS=24
NHIDDEN=1024
BATCHSIZE=4
LOGDIR="tensorboard_data/${NLAYERS}l_${NHIDDEN}h_${NNODES}n_${GPUS_PER_NODE}g_${pp_size}pp_${mp_size}mp_${BATCHSIZE}b_ds4"

GAS=16

#ZeRO Configs
stage=2
reduce_scatter=true
contigious_gradients=true
rbs=50000000
agbs=5000000000

#Actication Checkpointing and Contigious Memory
chkp_layers=1
PA=true
PA_CPU=false
CC=true
SYNCHRONIZE=true
PROFILE=false


gpt_options=" \
        --model-parallel-size ${mp_size} \
        --pipe-parallel-size ${pp_size} \
        --num-layers $NLAYERS \
        --hidden-size $NHIDDEN \
        --num-attention-heads 16 \
        --seq-length 1024 \
        --max-position-embeddings 1024 \
        --batch-size $BATCHSIZE \
        --gas $GAS \
        --train-iters 320000 \
        --lr-decay-iters 320000 \
        --save $CHECKPOINT_PATH \
        --load $CHECKPOINT_PATH \
        --data-path $DATA_PATH \
        --vocab-file $VOCAB_PATH \
        --merge-file $MERGE_PATH \
        --data-impl mmap \
        --split 949,50,1 \
        --distributed-backend nccl \
        --lr 1.5e-4 \
        --lr-decay-style cosine \
        --min-lr 1.0e-5 \
        --weight-decay 1e-2 \
        --clip-grad 1.0 \
        --warmup 0.01 \
        --checkpoint-activations \
        --log-interval 1 \
        --save-interval 500 \
        --eval-interval 100 \
        --eval-iters 10 \
        --fp16 \
        --tensorboard-dir ${LOGDIR}
"
  
 deepspeed_options=" \
                --deepspeed \
                --deepspeed_config ${config_json} \
                --zero-stage ${stage} \
                --zero-reduce-bucket-size ${rbs} \
                --zero-allgather-bucket-size ${agbs} 
            "

if [ "${contigious_gradients}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
                --zero-contigious-gradients"
fi

if [ "${reduce_scatter}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
                --zero-reduce-scatter"
fi

chkp_opt=" \
--checkpoint-activations \
--checkpoint-num-layers ${chkp_layers}"

if [ "${PA}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --partition-activations"
fi

if [ "${PA_CPU}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --checkpoint-in-cpu"
fi

if [ "${SYNCHRONIZE}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --synchronize-each-layer"
fi

if [ "${CC}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --contigious-checkpointing"
fi

if [ "${PROFILE}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --profile-backward"
fi

full_options="${gpt_options} ${deepspeed_options} ${chkp_opt}"

run_cmd="deepspeed --num_nodes ${DLWS_NUM_WORKER} --num_gpus ${DLWS_NUM_GPU_PER_WORKER} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} pretrain_gpt2.py $@ ${full_options}"
echo ${run_cmd}
eval ${run_cmd}

set +x

The zero 2 config is like this:

{
  "train_batch_size":256,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "reduce_scatter": true,
    "allgather_bucket_size": 50000000,
    "reduce_bucket_size": 50000000,
    "overlap_comm": true
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015,
      "max_grad_norm": 1.0,
      "betas": [0.9, 0.95]
    }
  },
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true,

    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "wall_clock_breakdown": true,
  "zero_allow_untested_optimizer": false
}

@ShadenSmith @jeffra

@melissamao
Copy link

hello~I also face the same problem and I wonder how to solve it. Thanks!

@FannierPeng
Copy link
Author

FannierPeng commented Mar 1, 2024 via email

@tjruwase
Copy link
Contributor

tjruwase commented Mar 1, 2024

@melissamao and @FannierPeng, that codebase is deprecated.

Please use https://github.com/microsoft/Megatron-DeepSpeed as recommended.

@tjruwase tjruwase closed this as completed Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants