-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NAN loss for regression while training #2134
Comments
You can't use softmax with only a single output unit. |
sry it's commented...there is no softmax layer at the output..i've updated the question |
Have you tried reducing the batch size? I sometimes get loss: nan with my LSTM networks for time-series regression, and I can nearly always avoid it either by reducing the sizes of my layers, or by reducing the batch size. |
Have you tried using e.g., rmsprop instead of sgd? Usually worked better for me with regression. |
A few comments. Your l2 regularizers are using pretty large terms, try something much smaller to start, i.e. |
I wanted to point this out so that it's archived for others who may experience this problem in future. I was running into my loss function suddenly returning a nan after it go so far into the training process. I checked the relus, the optimizer, the loss function, my dropout in accordance with the relus, the size of my network and the shape of the network. I was still getting loss that eventually turned into a nan and I was getting quite fustrated. Then it dawned on me. I may have some bad input. It turns out, one of the images that I was handing to my CNN (and doing mean normalization on) was nothing but 0's. I wasn't checking for this case when I subtracted the mean and normalized by the std deviation and thus I ended up with an exemplar matrix which was nothing but nan's. Once I fixed my normalization function, my network now trains perfectly. |
Share my experience for benefit of others... One thing I found is the optimizer plays a role in nan loss issue. Changing from rmsprop to adam optimizer makes this problem go away for me when training a LSTM. |
I recall I had such problems when I used SGD optimizer too, and also rmsprop. Try adam. |
For future reference: NaN loss could come from any value in your dataset that is not float or int. In my case, there were some NumPy infinities (np.inf), resulting from divide by zero in my program that prepares the dataset. Checking for inf or nan data first may save you some time spent trying to find faults in the model. |
@ctawong I'm using |
Optimizer selection was a major factor in my problem as well (image convolution with unbounded output - gradient explosion with SGD). My experience was that RMSprop with heavy regularization was effective in preventing gradient explosion, but that caused training to converge very slowly (many steps / epochs required). Adam worked with no dropout / regularization and consequently converged very quickly. Whether no dropout / regularization is a good idea (as it helps prevent over-fitting) is a separate question but at least now I can determine the proper amount. |
My recommendations regarding the issue:
df.isnull().any() |
I spent literally hours on this problem, going through every possible suggestion. Then discovered that one column in my data set had all the same numerical value, making it effectively a worthless addition to the DNN. I'd recommend anyone to go right back to their data and don't make any assumptions or take anything for granted. |
I have normalized my input data to [0,1] which solved the error where loss = nan |
Changing from sgd to rmsprop solved the problem for me (linear regression problem) |
I also had these problems. I tried everything mentioned above and nothing helped. But now I seem to have found a solution. I am using a fit-generator and in the Keras Documentation (fit_generator) it mentions that:
I still changed my generator to only output batches the right size. And voila, since then I dont get NaN and inf anymore. Not sure if this helps everybody but I still want to post what helped me. |
I tried every suggestion on this page and many others to no avail. We were importing csv files with pandas, then using If you're performing textual analysis and getting |
I had the loss = nan issue and I solved it by making sure the number of classes in the config and my dataset are the same. the default num classes was 92+1. |
Hi,guys. I meet a weird question now. Traing:10000 images example 1.
When i runing the code , the train-loss and val-loss are NAN.
But when i runing this code , the train-loss and val-loss are normal. |
I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna()) I hope this helps someone encountering similar problem |
Hi I put model.add(BatchNormalization()) after conv layers and works for me |
In my case, it is the loss function. I used loss='sparse_categorical_crossentropy' and switched to loss=losses.mean_squared_error (from keras import losses). loss got normal. |
I solved my "loss: nan" problem by fixing my annotations. |
The problem can happen for several reasons, mine was because of the second item,
|
I experience the same issue and wanted to share that in my case it wasn't one of the features that had the nan/inf value, it was actually an infinite Y value. Hope that will help someone... |
I was facing the same problem. I thought I had my inputs covered and carried out few suggestions mentioned here, but to no avail. When I inspected my inputs again, I had one value as NaN. When I took care of it, it worked. |
Thank you so much! The NaN of val loss disappear. But my validation set is an integer multiple of the batch size. Why is this helpful? |
Thanks to @eng-tsmith |
All discussions talk |
How did you do that inside the keras loss function? |
I had the problem that my regularization loss became |
I had similar issues with my loss function, moreover the eigenvalue decomposition inside tensorflow for v 1.13.1 has the issue that if any singular value is encountered.. the loss function still remains NaN even if you filter Nan from the values and perform any aggregate operations. |
in my case model. predic returns nan values. The model is already trained and i have saved weights. After compiling the model and loading weights the prediction is returning nan. |
I was sometimes taking the log of a very small number somewhere in my cost function. I added a tiny amount of jitter to stop the output becoming -inf and a NaN being produced at the next update. Hope this might help someone one day! |
I had the same error for a multiclass problem i was dealing with. At first i had my output layer with only 1 node and it was giving me an loss of nan, so i changed the output layer to the number of classes i had and it worked!!! |
I tried all the solutions mentioned above until I finally figured out that I put an intermediate backend layer to log transform a list that possibly contains 0 values, hope it helps. |
I tried a lot of alternatives and it looks like the NaN loss can be caused by many different things. In my case, the Keras custom image generator (Keras Sequence) I wrote was the culprit. While generating batches of images, I had initialized the numpy array as |
Retraining helped me to solve this problem.
…On Mon, Feb 24, 2020 at 12:46 PM Vikas Nataraja ***@***.***> wrote:
I tried a lot of alternatives and it looks like the NaN loss can be caused
by many different things. In my case, the custom image generator I wrote
was the culprit. While generating batches of images, I had initialized the
numpy array as np.empty((self.batch_size, self.resize_shape_tuple[0],
self.resize_shape_tuple[1], self.num_channels)). I changed that to
np.zeros() and it resolved the issue.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2134?email_source=notifications&email_token=AGEUNATEXSPEMYVF33KXMCDRENX4HA5CNFSM4B7ONXMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMW2CPY#issuecomment-590192959>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEUNAV6OW7IENBAR2CNRGLRENX4HANCNFSM4B7ONXMA>
.
--
*Mr. Dattatray D Sawat.*
*Research Scholar,*
*Department of Computer science,*
*School of Computational Sciences,*
*Solapur University, Solapur-413255.*
*Ph:+918007163397*
|
Thanks @lhatsk, using RMSprop instead of SGD solved my problem |
This is nice, but dropping the whole row because of a nan value is bad for timeseries problems. I found that using df.fillna(0) gets better results! |
My problem was that my y(target) value was [1,2] not [0,1] this made my loss value to negative and eventually it made an nan loss value |
For me the core of the problem was, that I used "relu" as activation function in the LSTM-layer. |
I had this problem, and my model would predict "NaNs" on any data, even though my losses were decreasing normally. |
In my case, the problem was the number of last output neuron doesn't match with the actual label size. Causing me headache for this bug. :) |
Appreciate! Looks like I got the same problem. Thanks!!!!! |
For everyone out there who might face the nan issue when using the ELBO loss for VAEs here is what seems to have fixed it for me. Short version: Long version:
All of the above did not fix the issue. I finally tried to drop the learning rate from 1e-3 down to 1e-4 which made all the difference. No more nan so far. I can now train with very small to no L2 kernel normalization. The VAE now also trains with differently normalized training data (sigmoid, tanh, z-score). Side note: |
if your file have any missing value then you get loss = nan |
Now i know no one would have made it this far into the post. But for those of you who still have this problem, here's a tip DON'T use Negative integers in your |
In my case, the optimizer is the trick. Use adagrad instead of adam fix the problem. The problem only happened when I train the model distributedly with adam. In single node situation, adam works fine. |
I am using Skelton (x,y) coordinates from MPII dataset. Changing relu with tanh worked for me. |
Finally, tweak the epsilon parameter from 1e-8(default value) to 1e-6 fix the problem. This may cause the gradient decent slower, but the stability of training is more important. |
I'm running a regression model on patches of size 32x32 extracted from images against a real value as the target value. I have 200,000 samples for training but during the first epoch itself, I'm encountering a nan loss. Can anyone help me solve this problem please ? I've tried on both GPU and CPU but the issue still appears.
model = Sequential()
model.add(Convolution2D(50, 7, 7, border_mode='valid',input_shape=(1, 32, 32)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(800, W_regularizer=l2(0.5)))
model.add(Activation('relu'))
model.add(Dropout(0.7))
model.add(Dense(800,W_regularizer=l2(0.5)))
model.add(Activation('relu'))
model.add(Dropout(0.7))
model.add(Dense(1))
sgd = SGD(lr=0.00001, decay=1e-6, momentum=0.9, nesterov=True,clipnorm=100)
model.compile(loss='mean_squared_error', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=256, nb_epoch=40)
The text was updated successfully, but these errors were encountered: