Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Megatron-LM pretrain_bert with deepspeed #179

Closed
sj6077 opened this issue Mar 31, 2020 · 0 comments
Closed

Megatron-LM pretrain_bert with deepspeed #179

sj6077 opened this issue Mar 31, 2020 · 0 comments

Comments

@sj6077
Copy link

sj6077 commented Mar 31, 2020

I got this error message following GPT2 tutorial to the BERT code.
Can you let me know something that I missed?
The error message is "RuntimeError: expected scalar type Float but found Half (data at /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/include/ATen/core/TensorMethods.h:1821)", and below is the trace.

Traceback (most recent call last):
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 617, in
main()
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 595, in main
timers, args)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 354, in train
args, timers)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 304, in train_step
args, timers)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 232, in forward_step
checkpoint_activations=args.checkpoint_activations)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/deepspeed/pt/deepspeed_light.py", line 613, in forward
loss = self.module(*inputs, **kwargs)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/model/distributed.py", line 78, in forward
return self.module(*inputs, **kwargs)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(input, **kwargs)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 65, in forward
return fp16_to_fp32(self.module(
(fp32_to_fp16(inputs)), **kwargs))
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/model/model.py", line 82, in forward
checkpoint_activations=checkpoint_activations)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/model/modeling.py", line 944, in forward
output_all_encoded_layers=False, checkpoint_activations=checkpoint_activations)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, *kwargs)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/model/modeling.py", line 869, in forward
embedding_output = self.embeddings(input_ids, token_type_ids)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(input, kwargs)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/model/modeling.py", line 300, in forward
embeddings = self.LayerNorm(embeddings)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(input, kwargs)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/apex/normalization/fused_layer_norm.py", line 159, in forward
input, self.weight, self.bias, self.normalized_shape,self.eps)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/apex/normalization/fused_layer_norm.py", line 25, in forward
input_, ctx.normalized_shape, weight_, bias_, ctx.eps)
RuntimeError: expected scalar type Float but found Half (data at /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/include/ATen/core/TensorMethods.h:1821)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fcc922f5273 in /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: float
at::Tensor::data() const + 0x449 (0x7fc8843aa5e9 in /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #2: cuda_layer_norm(at::Tensor
, at::Tensor
, at::Tensor
, at::Tensor
, int, int, c10::ArrayRef, at::Tensor, at::Tensor, double) + 0x725 (0x7fc8843a76c5 in /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #3: layer_norm_affine(at::Tensor, c10::ArrayRef, at::Tensor, at::Tensor, double) + 0x2a4 (0x7fc884394ca4 in /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x1e254 (0x7fc8843a5254 in /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #5: + 0x1a8e0 (0x7fc8843a18e0 in /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)

@sj6077 sj6077 closed this as completed Mar 31, 2020
jeffra added a commit to jeffra/DeepSpeed that referenced this issue May 10, 2021
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant