-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss tensor without grad_fn arises during training #12
Comments
@dcasbol Hi! I came across the same problem and I was wondering if you managed to find a solution? I'm now struggling to find what are the possible reasons, and would greatly appreciate any hint. |
@Ramil0 To this moment I have not found solution to this. Fortunately, my research ended up not depending on it, but I would guess that some parts of the code are somehow hardcoded for working with that batch size. |
@dcasbol Thank you for your reply! This is strange, I didn't change the batch size, and I get this error even with the default numbers in the config. |
That's new info then! It doesn't depend on batch size. It then must have to do with the fact that some batches don't have any supervision. Maybe they implemented it in an older/newer version of PyTorch, which just skips that instead of raising the exception? If you're currently working on that, you could try to detect that scenario before it returns the value and substitute it by something like: dummy_loss = torch.tensor(0.0, dtype=torch.float, requires_grad=True) And see if it behaves in a reasonable way. |
Sorry, I completely missed replying to this issue. TBH, I don't know why this error occurs. I've spent quite a lot of time trying to fix it but couldn't. Maybe I'm missing something trivial. I had a thought similar what @dcasbol suggested, but couldn't get that to work. If you restart training, you shouldn't encounter the error, which to me suggests it has something to do with CUDA randomization. |
@dcasbol Thank you, I'll try what you suggested. @nitishgupta Thanks! Already restarted the training, there's hope then that I won't see the same error again:) |
After following the instructions and pip-installing the modified version of allennlp I was able to run the training script, but it lead to CUDA out-of-memory errors in my GPU, so I brought the batch size down from 4 to 3 and then it works, but I get the following error around 7% of the first epoch:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Is this something related to dependencies with the batch size or is there any way of solving it?
Note: I'm running this with CUDA 10.1
The text was updated successfully, but these errors were encountered: