Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer API crashes GPUs #11020

Closed
dmitriydligach opened this issue Apr 1, 2021 · 7 comments
Closed

Trainer API crashes GPUs #11020

dmitriydligach opened this issue Apr 1, 2021 · 7 comments

Comments

@dmitriydligach
Copy link

Environment info

  • transformers version: 4.5.0.dev0
  • Platform: Ubuntu 20.04.2 LTS
  • Python version: Python 3.8.5
  • PyTorch version (GPU?): 1.7.1
  • Tensorflow version (GPU?): 2.4.1
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

My scripts that use Trainer API crash GPUs on a Linux server that has 4 Quadro RTX 8000 GPUs (NVIDIA-SMI 460.39, Driver Version: 460.39, CUDA Version: 11.2). In order to understand if this is my problem or not, I installed Huggingface examples as described in

https://huggingface.co/transformers/examples.html.

I then run

python3 examples/seq2seq/run_summarization.py \

--model_name_or_path t5-large \
--do_train \
--do_eval \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size=2 \
--per_device_eval_batch_size=2 \
--overwrite_output_dir \
--predict_with_generate

After this script runs for a few minutes (and I can see that the GPUs are being utilized when I run nvidia-smi), all GPUs crash with the following error:

Traceback (most recent call last):
File "examples/seq2seq/run_summarization.py", line 591, in
main()
File "examples/seq2seq/run_summarization.py", line 529, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/dima/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1120, in train
tr_loss += self.training_step(model, inputs)
File "/home/dima/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1542, in training_step
loss.backward()
File "/usr/lib/python3/dist-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/lib/python3/dist-packages/torch/autograd/init.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: unspecified launch failure

When I run nvidia-smi, I get:

Unable to determine the device handle for GPU 0000:40:00.0: Unknown Error

Rebooting the server helps to restore the GPUs, but the same problem happens again if I try to run the example script above.

Please help! :)

@LysandreJik
Copy link
Member

Hi! This is a weird error, I haven't seen it before. I don't know how you've setup your installation and which CUDA version is torch using, but I believe it still has incompatibilities with CUDA 11.2. If you installed it as a wheel I think CUDA is included and it doesn't really matter, but as the only thing I see from your report is CUDA version 11.2 I can't help but wonder if that's the issue.

In any case I doubt that it's linked to transformers, have you checked the following issues on the PyTorch github? pytorch/pytorch#31702 and pytorch/pytorch#27837

It seems that memory can be an issue, but given the size of a Quadro 8000 it doubt that's the issue here ...

@dmitriydligach
Copy link
Author

@LysandreJik Thank you for getting back to me so quickly. I just checked which CUDA version torch is seeing:

torch.version
'1.7.1'
torch.version.cuda
'11.1'

I'm surprised that it's not CUDA 11.2 which is what nvidia-smi shows. Does this information help?

This doesn't seem like a GPU memory issues because the example script runs fine for a few minutes and I see (using nvidia-smi) that GPU memory is not being fully used.

@stas00
Copy link
Contributor

stas00 commented Apr 4, 2021

but I believe it still has incompatibilities with CUDA 11.2.

Correct: pytorch/pytorch#50232 (comment)

RuntimeError: CUDA error: unspecified launch failure

Due to its async nature, often the only way to get to see the real error is to run pytorch with env var: CUDA_LAUNCH_BLOCKING=1

Perhaps if you do that you will get better information.

Also please check the output of dmesg -T - sometimes the nvidia kernel module throws a kernel-level traceback in system logs.

@github-actions
Copy link

github-actions bot commented May 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@dmitriydligach
Copy link
Author

Responding to avoid making this issue stale:

This issue has not been resolved. I tried a number of different configurations including different versions of pytorch, but it didn't help.

@stas00
Copy link
Contributor

stas00 commented May 4, 2021

Care to try with CUDA_LAUNCH_BLOCKING as suggested in #11020 (comment)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants