Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAN loss for regression while training #2134

Closed
indra215 opened this issue Mar 30, 2016 · 57 comments
Closed

NAN loss for regression while training #2134

indra215 opened this issue Mar 30, 2016 · 57 comments

Comments

@indra215
Copy link

I'm running a regression model on patches of size 32x32 extracted from images against a real value as the target value. I have 200,000 samples for training but during the first epoch itself, I'm encountering a nan loss. Can anyone help me solve this problem please ? I've tried on both GPU and CPU but the issue still appears.

model = Sequential()

model.add(Convolution2D(50, 7, 7, border_mode='valid',input_shape=(1, 32, 32)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())

model.add(Dense(800, W_regularizer=l2(0.5)))

model.add(Activation('relu'))
model.add(Dropout(0.7))

model.add(Dense(800,W_regularizer=l2(0.5)))

model.add(Activation('relu'))
model.add(Dropout(0.7))

model.add(Dense(1))

sgd = SGD(lr=0.00001, decay=1e-6, momentum=0.9, nesterov=True,clipnorm=100)
model.compile(loss='mean_squared_error', optimizer=sgd)

model.fit(X_train, Y_train, batch_size=256, nb_epoch=40)

@NasenSpray
Copy link

You can't use softmax with only a single output unit.

@indra215
Copy link
Author

sry it's commented...there is no softmax layer at the output..i've updated the question

@jpeg729
Copy link
Contributor

jpeg729 commented Mar 30, 2016

Have you tried reducing the batch size?

I sometimes get loss: nan with my LSTM networks for time-series regression, and I can nearly always avoid it either by reducing the sizes of my layers, or by reducing the batch size.

@lhatsk
Copy link

lhatsk commented Mar 31, 2016

Have you tried using e.g., rmsprop instead of sgd? Usually worked better for me with regression.

@the-moliver
Copy link
Contributor

A few comments. Your l2 regularizers are using pretty large terms, try something much smaller to start, i.e. l2(0.001) or get rid of them altogether to see if that helps. You may be driving your weights to 0 too fast. Your dropout rates are also pretty high, generally people don't use above 0.5. Also for regression problems, I find larger batch sizes to be more useful, ~500.

@cjnolet
Copy link

cjnolet commented Dec 21, 2016

I wanted to point this out so that it's archived for others who may experience this problem in future. I was running into my loss function suddenly returning a nan after it go so far into the training process. I checked the relus, the optimizer, the loss function, my dropout in accordance with the relus, the size of my network and the shape of the network. I was still getting loss that eventually turned into a nan and I was getting quite fustrated.

Then it dawned on me. I may have some bad input. It turns out, one of the images that I was handing to my CNN (and doing mean normalization on) was nothing but 0's. I wasn't checking for this case when I subtracted the mean and normalized by the std deviation and thus I ended up with an exemplar matrix which was nothing but nan's. Once I fixed my normalization function, my network now trains perfectly.

@andrewssdd
Copy link

Share my experience for benefit of others... One thing I found is the optimizer plays a role in nan loss issue. Changing from rmsprop to adam optimizer makes this problem go away for me when training a LSTM.

@ylmeng
Copy link

ylmeng commented May 19, 2017

I recall I had such problems when I used SGD optimizer too, and also rmsprop. Try adam.

@mxh000
Copy link

mxh000 commented Jun 12, 2017

For future reference: NaN loss could come from any value in your dataset that is not float or int. In my case, there were some NumPy infinities (np.inf), resulting from divide by zero in my program that prepares the dataset. Checking for inf or nan data first may save you some time spent trying to find faults in the model.

@naisanza
Copy link

naisanza commented Aug 15, 2017

@ctawong I'm using relu for activation, categorical_crossentropy for loss, and adam for optimization and I'm getting nan for the loss value

@brittohalloran
Copy link

brittohalloran commented Oct 7, 2017

Optimizer selection was a major factor in my problem as well (image convolution with unbounded output - gradient explosion with SGD). My experience was that RMSprop with heavy regularization was effective in preventing gradient explosion, but that caused training to converge very slowly (many steps / epochs required).

Adam worked with no dropout / regularization and consequently converged very quickly. Whether no dropout / regularization is a good idea (as it helps prevent over-fitting) is a separate question but at least now I can determine the proper amount.

@unnir
Copy link

unnir commented Jan 4, 2018

My recommendations regarding the issue:

  • try different optimizers, f.e. sgd, nadam, adam...
  • scale you data differently, f.e. try these ranges [0,1] or [-1,1],
  • Also, in my case the learning rate parameter was the critical one.
  • and the most important thing:
    always check for NaNs or inf in your dataset.
    You can do it like this:
df.isnull().any() 

@gavriel-merkado
Copy link

I spent literally hours on this problem, going through every possible suggestion. Then discovered that one column in my data set had all the same numerical value, making it effectively a worthless addition to the DNN. I'd recommend anyone to go right back to their data and don't make any assumptions or take anything for granted.

@Mboga
Copy link

Mboga commented Feb 8, 2018

I have normalized my input data to [0,1] which solved the error where loss = nan

@emoen
Copy link

emoen commented Feb 28, 2018

Changing from sgd to rmsprop solved the problem for me (linear regression problem)

@eng-tsmith
Copy link

I also had these problems. I tried everything mentioned above and nothing helped.

But now I seem to have found a solution. I am using a fit-generator and in the Keras Documentation (fit_generator) it mentions that:

...different batches may have different sizes. The last batch of the
epoch is commonly smaller than the others...

I still changed my generator to only output batches the right size. And voila, since then I dont get NaN and inf anymore.

Not sure if this helps everybody but I still want to post what helped me.

@claycoleman
Copy link

I tried every suggestion on this page and many others to no avail. We were importing csv files with pandas, then using keras Tokenizer with text input to create vocabularies and word vector matrices. After noticing some CSV files led to nan while others worked, suddenly we looked at the encoding of the files and realized that ascii files were NOT working with keras, leading to nan loss and accuracy of 0.0000e+00; however, utf-8 and utf-16 files were working! Breakthrough.

If you're performing textual analysis and getting nan loss after trying these suggestions, use file -i {input} (linux) or file -I {input} (osx) to discover your file type. If you have ISO-8859-1 or us-ascii, try converting to utf-8 or utf-16le. Haven't tried the latter but I'd imagine it would work as well. Hopefully this helps someone very very frustrated!

@AloshkaD
Copy link

I had the loss = nan issue and I solved it by making sure the number of classes in the config and my dataset are the same. the default num classes was 92+1.

@alyato
Copy link

alyato commented Apr 19, 2018

Hi,guys. I meet a weird question now.

Traing:10000 images
Validation:2000 images
nb_classes:8

example 1.

base_model = densenet121(weights='imagenet',include_top=False)
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024,activation='relu')(x)
predictions = Dense(8,activation='sigmoid')(x)
model = Model(input=base_model.input,output=predictions)
for layer in base_model.layers:
layer.trainable = False
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model.fit(X_train,Y_train,batch_size=batch_size, nb_epoch=nb_epoch,verbose=1,validation_data= (X_val,Y_val))

When i runing the code , the train-loss and val-loss are NAN.
Then i change the network.
example 2.

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),padding='same',input_shape=(channels,img_rows, img_cols)))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('sigmoid'))
model.compile(loss=multitask_loss,optimizer='adam',metrics=['accuracy'])
model.fit(X_train, Y_train,batch_size=batch_size,epochs=epochs,verbose=1 ,validation_data=(X_val, Y_val))

But when i runing this code , the train-loss and val-loss are normal.
So i think it is the network while fine-tuning it.
I want to use the pre-trained DenseNet now, and how can i solve the loss NaN.
Thanks.

@Krithi07
Copy link

I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna())

I hope this helps someone encountering similar problem

@ghost
Copy link

ghost commented Jul 30, 2018

Hi

I put model.add(BatchNormalization()) after conv layers and works for me

@yosunpeng
Copy link

In my case, it is the loss function. I used loss='sparse_categorical_crossentropy' and switched to loss=losses.mean_squared_error (from keras import losses). loss got normal.

@DracoScript
Copy link

I solved my "loss: nan" problem by fixing my annotations.
I used a conversion script for annotations that changed some bounding boxes sizes to 0 width or height erroneously.

@Sorooshi
Copy link

The problem can happen for several reasons, mine was because of the second item,

  1. The existence of some NaNs, Null elements in the dataset.
  2. Inequality between the number of classes and the corresponding labels.

@rbahumi
Copy link

rbahumi commented Nov 6, 2018

I experience the same issue and wanted to share that in my case it wasn't one of the features that had the nan/inf value, it was actually an infinite Y value.

Hope that will help someone...
FYI.

@vijayakumar-govindarajulu

I was facing the same problem. I thought I had my inputs covered and carried out few suggestions mentioned here, but to no avail. When I inspected my inputs again, I had one value as NaN. When I took care of it, it worked.

@ZhuoyaYang
Copy link

I also had these problems. I tried everything mentioned above and nothing helped.
But now I seem to have found a solution. I am using a fit-generator and in the Keras Documentation (fit_generator) it mentions that:

...different batches may have different sizes. The last batch of the
epoch is commonly smaller than the others...

I still changed my generator to only output batches the right size. And voila, since then I dont get NaN and inf anymore.
Not sure if this helps everybody but I still want to post what helped me.

I am using a model.fit_generator() in keras, I got the problem that: train_loss is normal and decreasing, but the val_loss is inf. This is so strange, and I check everying and dont know why.
Finally, I change the code in customize data_generator:

def __len__(self):
        #return int(np.ceil(len(self.names) / float(self.batch_size)))
        return int(np.floor(len(self.names) / float(self.batch_size)))

This means that drop the last batch in data_generator, and this works!

Thank you so much! The NaN of val loss disappear. But my validation set is an integer multiple of the batch size. Why is this helpful?

@p30arena
Copy link

p30arena commented Sep 17, 2019

Thanks to @eng-tsmith
In my case, I had to use "fit_generator" instead of "fit",
Then I realized that I was passing only one ground truth for each batch, so I fixed it.

@mdalvi
Copy link

mdalvi commented Nov 4, 2019

All discussions talk NaN in input data. I found my culprit in the output data. Fixed it by removing NaN from the regression target output.

@khums
Copy link

khums commented Nov 29, 2019

All discussions talk NaN in input data. I found my culprit in the output data. Fixed it by removing NaN from the regression target output.

How did you do that inside the keras loss function?

@mcourteaux
Copy link
Contributor

I had the problem that my regularization loss became inf (infinite). This was clear because the actual prediction loss was still a nice float. The reason for me was that my activity regularization, containing a squaring operation, was getting some very large values: larger than the square root of the maximal IEEE floating point value, such that, after squaring it, the result became (aka "was rounded to") infinity.

@khums
Copy link

khums commented Dec 1, 2019

I had the problem that my regularization loss became inf (infinite). This was clear because the actual prediction loss was still a nice float. The reason for me was that my activity regularization, containing a squaring operation, was getting some very large values: larger than the square root of the maximal IEEE floating point value, such that, after squaring it, the result became (aka "was rounded to") infinity.

I had similar issues with my loss function, moreover the eigenvalue decomposition inside tensorflow for v 1.13.1 has the issue that if any singular value is encountered.. the loss function still remains NaN even if you filter Nan from the values and perform any aggregate operations.

@Sawatdatta
Copy link

in my case model. predic returns nan values. The model is already trained and i have saved weights. After compiling the model and loading weights the prediction is returning nan.

@RCTimms
Copy link

RCTimms commented Jan 22, 2020

I was sometimes taking the log of a very small number somewhere in my cost function. I added a tiny amount of jitter to stop the output becoming -inf and a NaN being produced at the next update.

Hope this might help someone one day!

@raykipa
Copy link

raykipa commented Jan 29, 2020

I had the same error for a multiclass problem i was dealing with. At first i had my output layer with only 1 node and it was giving me an loss of nan, so i changed the output layer to the number of classes i had and it worked!!!

@nishuai
Copy link

nishuai commented Feb 12, 2020

I tried all the solutions mentioned above until I finally figured out that I put an intermediate backend layer to log transform a list that possibly contains 0 values, hope it helps.

@vikasnataraja
Copy link

vikasnataraja commented Feb 24, 2020

I tried a lot of alternatives and it looks like the NaN loss can be caused by many different things. In my case, the Keras custom image generator (Keras Sequence) I wrote was the culprit. While generating batches of images, I had initialized the numpy array as np.empty((self.batch_size, self.resize_shape_tuple[0], self.resize_shape_tuple[1], self.num_channels)). I changed that to np.zeros() and it resolved the issue.

@Sawatdatta
Copy link

Sawatdatta commented Feb 24, 2020 via email

@yasersakkaf
Copy link

yasersakkaf commented Feb 27, 2020

Thanks @lhatsk, using RMSprop instead of SGD solved my problem

@lambdavar
Copy link

lambdavar commented Mar 10, 2020

I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna())

I hope this helps someone encountering similar problem

This is nice, but dropping the whole row because of a nan value is bad for timeseries problems. I found that using df.fillna(0) gets better results!

@Ldoun
Copy link

Ldoun commented Oct 8, 2020

My problem was that my y(target) value was [1,2] not [0,1] this made my loss value to negative and eventually it made an nan loss value
so check if your y value is correct

@SevenUp92
Copy link

For me the core of the problem was, that I used "relu" as activation function in the LSTM-layer.
I replaced "relu" with "tanh" and it worked fine.

@taesookim0412
Copy link

I had this problem, and my model would predict "NaNs" on any data, even though my losses were decreasing normally.
This probably means that there was corruption when it was processing the last batch. Therefore, I changed two things, I changed my model from outputting an activation to outputting a sequential layer (in mixed precision), although I don't think this was the cause of the problem. I also used the drop_remainder=True argument in Dataset.batch(). Now it doesn't mysteriously all go NaN after the first epoch. I'm not sure why this even happened, since it worked just fine with other activation functions.

@saihtaungkham
Copy link

In my case, the problem was the number of last output neuron doesn't match with the actual label size. Causing me headache for this bug. :)

@XXZhou25
Copy link

I wanted to point this out so that it's archived for others who may experience this problem in future. I was running into my loss function suddenly returning a nan after it go so far into the training process. I checked the relus, the optimizer, the loss function, my dropout in accordance with the relus, the size of my network and the shape of the network. I was still getting loss that eventually turned into a nan and I was getting quite fustrated.

Then it dawned on me. I may have some bad input. It turns out, one of the images that I was handing to my CNN (and doing mean normalization on) was nothing but 0's. I wasn't checking for this case when I subtracted the mean and normalized by the std deviation and thus I ended up with an exemplar matrix which was nothing but nan's. Once I fixed my normalization function, my network now trains perfectly.

Appreciate! Looks like I got the same problem. Thanks!!!!!

@paweller
Copy link

paweller commented Apr 14, 2021

For everyone out there who might face the nan issue when using the ELBO loss for VAEs here is what seems to have fixed it for me.

Short version:
Low learning rate of about 1e-4.
Thanks to @unnir who directed me into the right direction with his answer.

Long version:
I have tried countless things including

  • Making sure that there is no nan in the input data (np.any(np.isnan(data))
  • Normalizing the input data to the definition domain of sigmoid [0, 1], tanh [-1, 1], z-score (zero mean and unit variance).
  • Using different optimizers like Adam or RMSprop.
  • Regularizing the networks weights with kernel regularizers, batch normalization or a better suited way of normalizing layer weights for VAEs than the well know batch normalization.
  • Initializing kernels in different ways, namely GlorotUniform, RandomUniform and Zeros.

All of the above did not fix the issue. I finally tried to drop the learning rate from 1e-3 down to 1e-4 which made all the difference. No more nan so far. I can now train with very small to no L2 kernel normalization. The VAE now also trains with differently normalized training data (sigmoid, tanh, z-score).

Side note:
Here are some resources on VAEs and ELBO that I found quite useful.

@darshankachhadiya
Copy link

if your file have any missing value then you get loss = nan
so make sure that you handle missing values in data

@pratyush3124
Copy link

Now i know no one would have made it this far into the post. But for those of you who still have this problem, here's a tip DON'T use Negative integers in your y data when using spare_categorical_crossentropy. That solved my problem.

@yysw
Copy link

yysw commented Nov 9, 2021

In my case, the optimizer is the trick. Use adagrad instead of adam fix the problem. The problem only happened when I train the model distributedly with adam. In single node situation, adam works fine.

@NomiMalik0207
Copy link

NomiMalik0207 commented Jan 29, 2022

I am using Skelton (x,y) coordinates from MPII dataset. Changing relu with tanh worked for me.
Hopfully this will help anyone in future coz I have suffered a lot.

@yysw
Copy link

yysw commented Jan 30, 2022

In my case, the optimizer is the trick. Use adagrad instead of adam fix the problem. The problem only happened when I train the model distributedly with adam. In single node situation, adam works fine.

Finally, tweak the epsilon parameter from 1e-8(default value) to 1e-6 fix the problem. This may cause the gradient decent slower, but the stability of training is more important.
See the post in stack overflow for detail: https://stackoverflow.com/questions/43221065/how-does-the-epsilon-hyperparameter-affect-tf-train-adamoptimizer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests