You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to run Megatron with ZeRO 2 config when I encountered this error
The code version is Megatron-LM-v1.1.5-3D_parallelism.
Traceback (most recent call last):
File "pretrain_gpt2.py", line 158, in <module>
args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 100, in pretrain
train_data_iterator, valid_data_iterator)
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 485, in train
lr_scheduler)
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 325, in train_step
return train_step_pipe(model, data_iterator)
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 359, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 283, in train_batch
self._exec_schedule(sched)
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 1161, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 219, in _exec_reduce_tied_grads
self.module.allreduce_tied_weight_gradients()
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
dist.all_reduce(weight.grad, group=comm['group'])
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 890, in all_reduce
_check_single_tensor(tensor, "tensor")
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_single_tensor
"to be of type torch.Tensor.".format(param_name))
RuntimeError: Invalid function argument. Expected parameter `tensor` to be of type torch.Tensor.
It seems that File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
dist.all_reduce(weight.grad, group=comm['group'])
the weight.grad is not Tensor. But this error doesn't occur with ZERO 0 and 1 config
I was trying to run Megatron with ZeRO 2 config when I encountered this error
The code version is Megatron-LM-v1.1.5-3D_parallelism.
It seems that File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
dist.all_reduce(weight.grad, group=comm['group'])
the weight.grad is not Tensor. But this error doesn't occur with ZERO 0 and 1 config
My script is like this:
The zero 2 config is like this:
@ShadenSmith @jeffra
The text was updated successfully, but these errors were encountered: