-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer API crashes GPUs #11020
Comments
Hi! This is a weird error, I haven't seen it before. I don't know how you've setup your installation and which CUDA version is torch using, but I believe it still has incompatibilities with CUDA 11.2. If you installed it as a wheel I think CUDA is included and it doesn't really matter, but as the only thing I see from your report is CUDA version 11.2 I can't help but wonder if that's the issue. In any case I doubt that it's linked to It seems that memory can be an issue, but given the size of a Quadro 8000 it doubt that's the issue here ... |
@LysandreJik Thank you for getting back to me so quickly. I just checked which CUDA version torch is seeing:
I'm surprised that it's not CUDA 11.2 which is what nvidia-smi shows. Does this information help? This doesn't seem like a GPU memory issues because the example script runs fine for a few minutes and I see (using nvidia-smi) that GPU memory is not being fully used. |
Correct: pytorch/pytorch#50232 (comment)
Due to its async nature, often the only way to get to see the real error is to run pytorch with env var: Perhaps if you do that you will get better information. Also please check the output of |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Responding to avoid making this issue stale: This issue has not been resolved. I tried a number of different configurations including different versions of pytorch, but it didn't help. |
Care to try with CUDA_LAUNCH_BLOCKING as suggested in #11020 (comment) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Environment info
transformers
version: 4.5.0.dev0My scripts that use Trainer API crash GPUs on a Linux server that has 4 Quadro RTX 8000 GPUs (NVIDIA-SMI 460.39, Driver Version: 460.39, CUDA Version: 11.2). In order to understand if this is my problem or not, I installed Huggingface examples as described in
https://huggingface.co/transformers/examples.html.
I then run
python3 examples/seq2seq/run_summarization.py \
After this script runs for a few minutes (and I can see that the GPUs are being utilized when I run nvidia-smi), all GPUs crash with the following error:
Traceback (most recent call last):
File "examples/seq2seq/run_summarization.py", line 591, in
main()
File "examples/seq2seq/run_summarization.py", line 529, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/dima/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1120, in train
tr_loss += self.training_step(model, inputs)
File "/home/dima/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1542, in training_step
loss.backward()
File "/usr/lib/python3/dist-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/lib/python3/dist-packages/torch/autograd/init.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: unspecified launch failure
When I run nvidia-smi, I get:
Unable to determine the device handle for GPU 0000:40:00.0: Unknown Error
Rebooting the server helps to restore the GPUs, but the same problem happens again if I try to run the example script above.
Please help! :)
The text was updated successfully, but these errors were encountered: