Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function 'LogSoftmaxBackward' returned nan values in its 0th output #36

Closed
saniazahan opened this issue Apr 1, 2021 · 4 comments
Closed

Comments

@saniazahan
Copy link

saniazahan commented Apr 1, 2021

Hi I am trying to train your model with the provided config for NTU-60 XSUB with --half and --amp-opt-level 1. But after step 6 it gives a "Function 'LogSoftmaxBackward' returned nan values in its 0th output" error. I had kept the autograde anomally detect on and CUDA_LAUNCH_BLOCKING=1. Could you please tell me why this might happen. Also you provided pretrained model trained on un-normalized data. Is there a reason for that. What's your accuracy with normalized data. I tried both with normalized and un-normalized data but got "nan" at step 6. I guess if I turn off anomally detection it will not bother for the time being but I am concerned if it makes loss nan later. I checked my inputs, no "nan" there. Your suggestions will be a great help.

@saniazahan
Copy link
Author

saniazahan commented Apr 1, 2021

If I don't use half precision then I had to reduce the batch size to 16 and forward to 8. And the "nan" occurs at 94th step "Function 'CudnnBatchNormBackward' returned nan values in its 0th output"

You suggested I might get poor performance or unstable loss. I am not really sure why half precision will do that.

@kenziyuliu
Copy link
Owner

kenziyuliu commented May 2, 2021

Hi @saniazahan,

Thanks for your interest. Please find below responses to the questions:

"Function 'LogSoftmaxBackward' returned nan values in its 0th output"

I've never seen this error before, could it be related to your package versions?

"... pretrained model trained on un-normalized data ..."

IIRC the data preprocessing steps should follow directly from 2s-AGCN: https://github.com/lshiwjx/2s-AGCN. Can you clarify what "normalized data" you are referring to? One particular thing to note is that following previous work there's also a BN layer at the beginning of the model to do normalization: https://github.com/kenziyuliu/MS-G3D/blob/master/model/msg3d.py#L156

"reduce the batch size to 16 and forward to 8. And the "nan" occurs at 94th step "Function 'CudnnBatchNormBackward' returned nan values in its 0th output"

Unfortunately I have not seen this error before. In general batch size is a hyperparameter that often affects performance, so to reproduce results from the paper you should use the default settings. Note also that small batch sizes don't go well with BatchNorm.

@kenziyuliu
Copy link
Owner

Hi there, I'll be closing this issue for now. Feel free to comment below if the issue was not resolved.

@snknitin
Copy link

snknitin commented Nov 1, 2022

If it is an error in the 0th output, that means your weights are still not fully updated so some values in your first predictions are nans. So it's not your inputs, but your model predictions that are nans. Could be an overflow or underflow error. This will make any loss function give you a tensor(nan).What you can do is put a check for when loss is nan and let the weights adjust themselves

criterion = SomeLossFunc()
eps = 1e-6
loss = criterion(preds,targets)
if loss.isnan(): loss=eps
else: loss = loss.item()
loss = loss+ L1_loss + ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants