Trainer API crashes GPUs #11020

dmitriydligach · 2021-04-01T15:25:04Z

Environment info

transformers version: 4.5.0.dev0
Platform: Ubuntu 20.04.2 LTS
Python version: Python 3.8.5
PyTorch version (GPU?): 1.7.1
Tensorflow version (GPU?): 2.4.1
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

My scripts that use Trainer API crash GPUs on a Linux server that has 4 Quadro RTX 8000 GPUs (NVIDIA-SMI 460.39, Driver Version: 460.39, CUDA Version: 11.2). In order to understand if this is my problem or not, I installed Huggingface examples as described in

https://huggingface.co/transformers/examples.html.

I then run

python3 examples/seq2seq/run_summarization.py \

--model_name_or_path t5-large \
--do_train \
--do_eval \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size=2 \
--per_device_eval_batch_size=2 \
--overwrite_output_dir \
--predict_with_generate

After this script runs for a few minutes (and I can see that the GPUs are being utilized when I run nvidia-smi), all GPUs crash with the following error:

Traceback (most recent call last):
File "examples/seq2seq/run_summarization.py", line 591, in
main()
File "examples/seq2seq/run_summarization.py", line 529, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/dima/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1120, in train
tr_loss += self.training_step(model, inputs)
File "/home/dima/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1542, in training_step
loss.backward()
File "/usr/lib/python3/dist-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/lib/python3/dist-packages/torch/autograd/init.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: unspecified launch failure

When I run nvidia-smi, I get:

Unable to determine the device handle for GPU 0000:40:00.0: Unknown Error

Rebooting the server helps to restore the GPUs, but the same problem happens again if I try to run the example script above.

Please help! :)

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-04-01T16:15:36Z

Hi! This is a weird error, I haven't seen it before. I don't know how you've setup your installation and which CUDA version is torch using, but I believe it still has incompatibilities with CUDA 11.2. If you installed it as a wheel I think CUDA is included and it doesn't really matter, but as the only thing I see from your report is CUDA version 11.2 I can't help but wonder if that's the issue.

In any case I doubt that it's linked to transformers, have you checked the following issues on the PyTorch github? pytorch/pytorch#31702 and pytorch/pytorch#27837

It seems that memory can be an issue, but given the size of a Quadro 8000 it doubt that's the issue here ...

dmitriydligach · 2021-04-01T16:43:09Z

@LysandreJik Thank you for getting back to me so quickly. I just checked which CUDA version torch is seeing:

torch.version
'1.7.1'
torch.version.cuda
'11.1'

I'm surprised that it's not CUDA 11.2 which is what nvidia-smi shows. Does this information help?

This doesn't seem like a GPU memory issues because the example script runs fine for a few minutes and I see (using nvidia-smi) that GPU memory is not being fully used.

stas00 · 2021-04-04T02:40:14Z

but I believe it still has incompatibilities with CUDA 11.2.

Correct: pytorch/pytorch#50232 (comment)

RuntimeError: CUDA error: unspecified launch failure

Due to its async nature, often the only way to get to see the real error is to run pytorch with env var: CUDA_LAUNCH_BLOCKING=1

Perhaps if you do that you will get better information.

Also please check the output of dmesg -T - sometimes the nvidia kernel module throws a kernel-level traceback in system logs.

github-actions · 2021-05-02T15:01:57Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

dmitriydligach · 2021-05-04T18:49:23Z

Responding to avoid making this issue stale:

This issue has not been resolved. I tried a number of different configurations including different versions of pytorch, but it didn't help.

stas00 · 2021-05-04T18:57:02Z

Care to try with CUDA_LAUNCH_BLOCKING as suggested in #11020 (comment)

github-actions · 2021-05-29T15:08:39Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Jun 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer API crashes GPUs #11020

Trainer API crashes GPUs #11020

dmitriydligach commented Apr 1, 2021

LysandreJik commented Apr 1, 2021

dmitriydligach commented Apr 1, 2021

stas00 commented Apr 4, 2021 •

edited

Loading

github-actions bot commented May 2, 2021

dmitriydligach commented May 4, 2021

stas00 commented May 4, 2021

github-actions bot commented May 29, 2021

Trainer API crashes GPUs #11020

Trainer API crashes GPUs #11020

Comments

dmitriydligach commented Apr 1, 2021

Environment info

LysandreJik commented Apr 1, 2021

dmitriydligach commented Apr 1, 2021

stas00 commented Apr 4, 2021 • edited Loading

github-actions bot commented May 2, 2021

dmitriydligach commented May 4, 2021

stas00 commented May 4, 2021

github-actions bot commented May 29, 2021

stas00 commented Apr 4, 2021 •

edited

Loading