Loss tensor without grad_fn arises during training #12

dcasbol · 2020-10-25T11:14:07Z

After following the instructions and pip-installing the modified version of allennlp I was able to run the training script, but it lead to CUDA out-of-memory errors in my GPU, so I brought the batch size down from 4 to 3 and then it works, but I get the following error around 7% of the first epoch:

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Is this something related to dependencies with the batch size or is there any way of solving it?

Note: I'm running this with CUDA 10.1

The text was updated successfully, but these errors were encountered:

ramild · 2021-03-17T12:38:03Z

@dcasbol Hi! I came across the same problem and I was wondering if you managed to find a solution? I'm now struggling to find what are the possible reasons, and would greatly appreciate any hint.

dcasbol · 2021-03-17T12:48:56Z

@Ramil0 To this moment I have not found solution to this. Fortunately, my research ended up not depending on it, but I would guess that some parts of the code are somehow hardcoded for working with that batch size.

ramild · 2021-03-17T13:35:11Z

@Ramil0 To this moment I have not found solution to this. Fortunately, my research ended up not depending on it, but I would guess that some parts of the code are somehow hardcoded for working with that batch size.

@dcasbol Thank you for your reply! This is strange, I didn't change the batch size, and I get this error even with the default numbers in the config.

dcasbol · 2021-03-17T14:14:33Z

That's new info then! It doesn't depend on batch size. It then must have to do with the fact that some batches don't have any supervision. Maybe they implemented it in an older/newer version of PyTorch, which just skips that instead of raising the exception? If you're currently working on that, you could try to detect that scenario before it returns the value and substitute it by something like:

dummy_loss = torch.tensor(0.0, dtype=torch.float, requires_grad=True)

And see if it behaves in a reasonable way.

nitishgupta · 2021-03-17T15:15:24Z

Sorry, I completely missed replying to this issue.

TBH, I don't know why this error occurs. I've spent quite a lot of time trying to fix it but couldn't. Maybe I'm missing something trivial. I had a thought similar what @dcasbol suggested, but couldn't get that to work. If you restart training, you shouldn't encounter the error, which to me suggests it has something to do with CUDA randomization.

ramild · 2021-03-17T15:47:32Z

@dcasbol Thank you, I'll try what you suggested.

@nitishgupta Thanks! Already restarted the training, there's hope then that I won't see the same error again:)

nitishgupta closed this as completed Mar 17, 2021

ramild mentioned this issue Jun 11, 2021

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn #13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss tensor without grad_fn arises during training #12

Loss tensor without grad_fn arises during training #12

dcasbol commented Oct 25, 2020

ramild commented Mar 17, 2021

dcasbol commented Mar 17, 2021

ramild commented Mar 17, 2021

dcasbol commented Mar 17, 2021

nitishgupta commented Mar 17, 2021 •

edited

Loading

ramild commented Mar 17, 2021

Loss tensor without grad_fn arises during training #12

Loss tensor without grad_fn arises during training #12

Comments

dcasbol commented Oct 25, 2020

ramild commented Mar 17, 2021

dcasbol commented Mar 17, 2021

ramild commented Mar 17, 2021

dcasbol commented Mar 17, 2021

nitishgupta commented Mar 17, 2021 • edited Loading

ramild commented Mar 17, 2021

nitishgupta commented Mar 17, 2021 •

edited

Loading