Function 'LogSoftmaxBackward' returned nan values in its 0th output #36

saniazahan · 2021-04-01T09:40:46Z

Hi I am trying to train your model with the provided config for NTU-60 XSUB with --half and --amp-opt-level 1. But after step 6 it gives a "Function 'LogSoftmaxBackward' returned nan values in its 0th output" error. I had kept the autograde anomally detect on and CUDA_LAUNCH_BLOCKING=1. Could you please tell me why this might happen. Also you provided pretrained model trained on un-normalized data. Is there a reason for that. What's your accuracy with normalized data. I tried both with normalized and un-normalized data but got "nan" at step 6. I guess if I turn off anomally detection it will not bother for the time being but I am concerned if it makes loss nan later. I checked my inputs, no "nan" there. Your suggestions will be a great help.

saniazahan · 2021-04-01T12:56:05Z

If I don't use half precision then I had to reduce the batch size to 16 and forward to 8. And the "nan" occurs at 94th step "Function 'CudnnBatchNormBackward' returned nan values in its 0th output"

You suggested I might get poor performance or unstable loss. I am not really sure why half precision will do that.

kenziyuliu · 2021-05-02T10:43:02Z

Hi @saniazahan,

Thanks for your interest. Please find below responses to the questions:

"Function 'LogSoftmaxBackward' returned nan values in its 0th output"

I've never seen this error before, could it be related to your package versions?

"... pretrained model trained on un-normalized data ..."

IIRC the data preprocessing steps should follow directly from 2s-AGCN: https://github.com/lshiwjx/2s-AGCN. Can you clarify what "normalized data" you are referring to? One particular thing to note is that following previous work there's also a BN layer at the beginning of the model to do normalization: https://github.com/kenziyuliu/MS-G3D/blob/master/model/msg3d.py#L156

"reduce the batch size to 16 and forward to 8. And the "nan" occurs at 94th step "Function 'CudnnBatchNormBackward' returned nan values in its 0th output"

Unfortunately I have not seen this error before. In general batch size is a hyperparameter that often affects performance, so to reproduce results from the paper you should use the default settings. Note also that small batch sizes don't go well with BatchNorm.

kenziyuliu · 2021-05-16T11:19:03Z

Hi there, I'll be closing this issue for now. Feel free to comment below if the issue was not resolved.

snknitin · 2022-11-01T07:08:24Z

If it is an error in the 0th output, that means your weights are still not fully updated so some values in your first predictions are nans. So it's not your inputs, but your model predictions that are nans. Could be an overflow or underflow error. This will make any loss function give you a tensor(nan).What you can do is put a check for when loss is nan and let the weights adjust themselves

criterion = SomeLossFunc()
eps = 1e-6
loss = criterion(preds,targets)
if loss.isnan(): loss=eps
else: loss = loss.item()
loss = loss+ L1_loss + ...

kenziyuliu closed this as completed May 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function 'LogSoftmaxBackward' returned nan values in its 0th output #36

Function 'LogSoftmaxBackward' returned nan values in its 0th output #36

saniazahan commented Apr 1, 2021 •

edited

saniazahan commented Apr 1, 2021 •

edited

kenziyuliu commented May 2, 2021 •

edited

kenziyuliu commented May 16, 2021

snknitin commented Nov 1, 2022

Function 'LogSoftmaxBackward' returned nan values in its 0th output #36

Function 'LogSoftmaxBackward' returned nan values in its 0th output #36

Comments

saniazahan commented Apr 1, 2021 • edited

saniazahan commented Apr 1, 2021 • edited

kenziyuliu commented May 2, 2021 • edited

kenziyuliu commented May 16, 2021

snknitin commented Nov 1, 2022

saniazahan commented Apr 1, 2021 •

edited

saniazahan commented Apr 1, 2021 •

edited

kenziyuliu commented May 2, 2021 •

edited