-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function 'LogSoftmaxBackward' returned nan values in its 0th output #36
Comments
If I don't use half precision then I had to reduce the batch size to 16 and forward to 8. And the "nan" occurs at 94th step "Function 'CudnnBatchNormBackward' returned nan values in its 0th output" You suggested I might get poor performance or unstable loss. I am not really sure why half precision will do that. |
Hi @saniazahan, Thanks for your interest. Please find below responses to the questions:
I've never seen this error before, could it be related to your package versions?
IIRC the data preprocessing steps should follow directly from 2s-AGCN: https://github.com/lshiwjx/2s-AGCN. Can you clarify what "normalized data" you are referring to? One particular thing to note is that following previous work there's also a BN layer at the beginning of the model to do normalization: https://github.com/kenziyuliu/MS-G3D/blob/master/model/msg3d.py#L156
Unfortunately I have not seen this error before. In general batch size is a hyperparameter that often affects performance, so to reproduce results from the paper you should use the default settings. Note also that small batch sizes don't go well with BatchNorm. |
Hi there, I'll be closing this issue for now. Feel free to comment below if the issue was not resolved. |
If it is an error in the 0th output, that means your weights are still not fully updated so some values in your first predictions are nans. So it's not your inputs, but your model predictions that are nans. Could be an overflow or underflow error. This will make any loss function give you a criterion = SomeLossFunc()
eps = 1e-6
loss = criterion(preds,targets)
if loss.isnan(): loss=eps
else: loss = loss.item()
loss = loss+ L1_loss + ... |
Hi I am trying to train your model with the provided config for NTU-60 XSUB with --half and --amp-opt-level 1. But after step 6 it gives a "Function 'LogSoftmaxBackward' returned nan values in its 0th output" error. I had kept the autograde anomally detect on and CUDA_LAUNCH_BLOCKING=1. Could you please tell me why this might happen. Also you provided pretrained model trained on un-normalized data. Is there a reason for that. What's your accuracy with normalized data. I tried both with normalized and un-normalized data but got "nan" at step 6. I guess if I turn off anomally detection it will not bother for the time being but I am concerned if it makes loss nan later. I checked my inputs, no "nan" there. Your suggestions will be a great help.
The text was updated successfully, but these errors were encountered: