Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss tensor without grad_fn arises during training #12

Closed
dcasbol opened this issue Oct 25, 2020 · 6 comments
Closed

Loss tensor without grad_fn arises during training #12

dcasbol opened this issue Oct 25, 2020 · 6 comments

Comments

@dcasbol
Copy link

dcasbol commented Oct 25, 2020

After following the instructions and pip-installing the modified version of allennlp I was able to run the training script, but it lead to CUDA out-of-memory errors in my GPU, so I brought the batch size down from 4 to 3 and then it works, but I get the following error around 7% of the first epoch:

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Is this something related to dependencies with the batch size or is there any way of solving it?

Note: I'm running this with CUDA 10.1

@ramild
Copy link

ramild commented Mar 17, 2021

@dcasbol Hi! I came across the same problem and I was wondering if you managed to find a solution? I'm now struggling to find what are the possible reasons, and would greatly appreciate any hint.

@dcasbol
Copy link
Author

dcasbol commented Mar 17, 2021

@Ramil0 To this moment I have not found solution to this. Fortunately, my research ended up not depending on it, but I would guess that some parts of the code are somehow hardcoded for working with that batch size.

@ramild
Copy link

ramild commented Mar 17, 2021

@Ramil0 To this moment I have not found solution to this. Fortunately, my research ended up not depending on it, but I would guess that some parts of the code are somehow hardcoded for working with that batch size.

@dcasbol Thank you for your reply! This is strange, I didn't change the batch size, and I get this error even with the default numbers in the config.

@dcasbol
Copy link
Author

dcasbol commented Mar 17, 2021

That's new info then! It doesn't depend on batch size. It then must have to do with the fact that some batches don't have any supervision. Maybe they implemented it in an older/newer version of PyTorch, which just skips that instead of raising the exception? If you're currently working on that, you could try to detect that scenario before it returns the value and substitute it by something like:

dummy_loss = torch.tensor(0.0, dtype=torch.float, requires_grad=True)

And see if it behaves in a reasonable way.

@nitishgupta
Copy link
Owner

nitishgupta commented Mar 17, 2021

Sorry, I completely missed replying to this issue.

TBH, I don't know why this error occurs. I've spent quite a lot of time trying to fix it but couldn't. Maybe I'm missing something trivial. I had a thought similar what @dcasbol suggested, but couldn't get that to work. If you restart training, you shouldn't encounter the error, which to me suggests it has something to do with CUDA randomization.

@ramild
Copy link

ramild commented Mar 17, 2021

@dcasbol Thank you, I'll try what you suggested.

@nitishgupta Thanks! Already restarted the training, there's hope then that I won't see the same error again:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants