NAN loss for regression while training #2134

indra215 · 2016-03-30T10:49:05Z

I'm running a regression model on patches of size 32x32 extracted from images against a real value as the target value. I have 200,000 samples for training but during the first epoch itself, I'm encountering a nan loss. Can anyone help me solve this problem please ? I've tried on both GPU and CPU but the issue still appears.

model = Sequential()

model.add(Convolution2D(50, 7, 7, border_mode='valid',input_shape=(1, 32, 32)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())

model.add(Dense(800, W_regularizer=l2(0.5)))

model.add(Activation('relu'))
model.add(Dropout(0.7))

model.add(Dense(800,W_regularizer=l2(0.5)))

model.add(Activation('relu'))
model.add(Dropout(0.7))

model.add(Dense(1))

sgd = SGD(lr=0.00001, decay=1e-6, momentum=0.9, nesterov=True,clipnorm=100)
model.compile(loss='mean_squared_error', optimizer=sgd)

model.fit(X_train, Y_train, batch_size=256, nb_epoch=40)

NasenSpray · 2016-03-30T13:40:41Z

You can't use softmax with only a single output unit.

indra215 · 2016-03-30T18:24:48Z

sry it's commented...there is no softmax layer at the output..i've updated the question

jpeg729 · 2016-03-30T19:34:51Z

Have you tried reducing the batch size?

I sometimes get loss: nan with my LSTM networks for time-series regression, and I can nearly always avoid it either by reducing the sizes of my layers, or by reducing the batch size.

lhatsk · 2016-03-31T09:11:04Z

Have you tried using e.g., rmsprop instead of sgd? Usually worked better for me with regression.

the-moliver · 2016-03-31T18:26:35Z

A few comments. Your l2 regularizers are using pretty large terms, try something much smaller to start, i.e. l2(0.001) or get rid of them altogether to see if that helps. You may be driving your weights to 0 too fast. Your dropout rates are also pretty high, generally people don't use above 0.5. Also for regression problems, I find larger batch sizes to be more useful, ~500.

cjnolet · 2016-12-21T20:20:56Z

I wanted to point this out so that it's archived for others who may experience this problem in future. I was running into my loss function suddenly returning a nan after it go so far into the training process. I checked the relus, the optimizer, the loss function, my dropout in accordance with the relus, the size of my network and the shape of the network. I was still getting loss that eventually turned into a nan and I was getting quite fustrated.

Then it dawned on me. I may have some bad input. It turns out, one of the images that I was handing to my CNN (and doing mean normalization on) was nothing but 0's. I wasn't checking for this case when I subtracted the mean and normalized by the std deviation and thus I ended up with an exemplar matrix which was nothing but nan's. Once I fixed my normalization function, my network now trains perfectly.

andrewssdd · 2017-03-08T03:40:06Z

Share my experience for benefit of others... One thing I found is the optimizer plays a role in nan loss issue. Changing from rmsprop to adam optimizer makes this problem go away for me when training a LSTM.

ylmeng · 2017-05-19T20:35:17Z

I recall I had such problems when I used SGD optimizer too, and also rmsprop. Try adam.

mxh000 · 2017-06-12T12:59:41Z

For future reference: NaN loss could come from any value in your dataset that is not float or int. In my case, there were some NumPy infinities (np.inf), resulting from divide by zero in my program that prepares the dataset. Checking for inf or nan data first may save you some time spent trying to find faults in the model.

naisanza · 2017-08-15T17:26:10Z

@ctawong I'm using relu for activation, categorical_crossentropy for loss, and adam for optimization and I'm getting nan for the loss value

brittohalloran · 2017-10-07T11:38:02Z

Optimizer selection was a major factor in my problem as well (image convolution with unbounded output - gradient explosion with SGD). My experience was that RMSprop with heavy regularization was effective in preventing gradient explosion, but that caused training to converge very slowly (many steps / epochs required).

Adam worked with no dropout / regularization and consequently converged very quickly. Whether no dropout / regularization is a good idea (as it helps prevent over-fitting) is a separate question but at least now I can determine the proper amount.

unnir · 2018-01-04T11:12:41Z

My recommendations regarding the issue:

try different optimizers, f.e. sgd, nadam, adam...
scale you data differently, f.e. try these ranges [0,1] or [-1,1],
Also, in my case the learning rate parameter was the critical one.
and the most important thing:
always check for NaNs or inf in your dataset.
You can do it like this:

df.isnull().any()

gavriel-merkado · 2018-02-08T14:12:46Z

I spent literally hours on this problem, going through every possible suggestion. Then discovered that one column in my data set had all the same numerical value, making it effectively a worthless addition to the DNN. I'd recommend anyone to go right back to their data and don't make any assumptions or take anything for granted.

Mboga · 2018-02-08T14:49:46Z

I have normalized my input data to [0,1] which solved the error where loss = nan

emoen · 2018-02-28T15:07:50Z

Changing from sgd to rmsprop solved the problem for me (linear regression problem)

eng-tsmith · 2018-03-07T14:16:09Z

I also had these problems. I tried everything mentioned above and nothing helped.

But now I seem to have found a solution. I am using a fit-generator and in the Keras Documentation (fit_generator) it mentions that:

...different batches may have different sizes. The last batch of the
epoch is commonly smaller than the others...

I still changed my generator to only output batches the right size. And voila, since then I dont get NaN and inf anymore.

Not sure if this helps everybody but I still want to post what helped me.

claycoleman · 2018-03-08T09:59:28Z

I tried every suggestion on this page and many others to no avail. We were importing csv files with pandas, then using keras Tokenizer with text input to create vocabularies and word vector matrices. After noticing some CSV files led to nan while others worked, suddenly we looked at the encoding of the files and realized that ascii files were NOT working with keras, leading to nan loss and accuracy of 0.0000e+00; however, utf-8 and utf-16 files were working! Breakthrough.

If you're performing textual analysis and getting nan loss after trying these suggestions, use file -i {input} (linux) or file -I {input} (osx) to discover your file type. If you have ISO-8859-1 or us-ascii, try converting to utf-8 or utf-16le. Haven't tried the latter but I'd imagine it would work as well. Hopefully this helps someone very very frustrated!

AloshkaD · 2018-03-13T05:12:06Z

I had the loss = nan issue and I solved it by making sure the number of classes in the config and my dataset are the same. the default num classes was 92+1.

alyato · 2018-04-19T03:12:54Z

Hi,guys. I meet a weird question now.

Traing:10000 images
Validation:2000 images
nb_classes:8

example 1.

base_model = densenet121(weights='imagenet',include_top=False)
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024,activation='relu')(x)
predictions = Dense(8,activation='sigmoid')(x)
model = Model(input=base_model.input,output=predictions)
for layer in base_model.layers:
layer.trainable = False
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model.fit(X_train,Y_train,batch_size=batch_size, nb_epoch=nb_epoch,verbose=1,validation_data= (X_val,Y_val))

When i runing the code , the train-loss and val-loss are NAN.
Then i change the network.
example 2.

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),padding='same',input_shape=(channels,img_rows, img_cols)))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('sigmoid'))
model.compile(loss=multitask_loss,optimizer='adam',metrics=['accuracy'])
model.fit(X_train, Y_train,batch_size=batch_size,epochs=epochs,verbose=1 ,validation_data=(X_val, Y_val))

But when i runing this code , the train-loss and val-loss are normal.
So i think it is the network while fine-tuning it.
I want to use the pre-trained DenseNet now, and how can i solve the loss NaN.
Thanks.

Krithi07 · 2018-05-31T11:39:11Z

I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna())

I hope this helps someone encountering similar problem

ghost · 2018-07-30T15:21:21Z

Hi

I put model.add(BatchNormalization()) after conv layers and works for me

yosunpeng · 2018-08-13T01:26:02Z

In my case, it is the loss function. I used loss='sparse_categorical_crossentropy' and switched to loss=losses.mean_squared_error (from keras import losses). loss got normal.

DracoScript · 2018-10-17T12:45:25Z

I solved my "loss: nan" problem by fixing my annotations.
I used a conversion script for annotations that changed some bounding boxes sizes to 0 width or height erroneously.

Sorooshi · 2018-10-25T12:04:33Z

The problem can happen for several reasons, mine was because of the second item,

The existence of some NaNs, Null elements in the dataset.
Inequality between the number of classes and the corresponding labels.

rbahumi · 2018-11-06T12:22:53Z

I experience the same issue and wanted to share that in my case it wasn't one of the features that had the nan/inf value, it was actually an infinite Y value.

Hope that will help someone...
FYI.

vijayakumar-govindarajulu · 2019-07-16T14:16:55Z

I was facing the same problem. I thought I had my inputs covered and carried out few suggestions mentioned here, but to no avail. When I inspected my inputs again, I had one value as NaN. When I took care of it, it worked.

ZhuoyaYang · 2019-08-15T00:01:02Z

I also had these problems. I tried everything mentioned above and nothing helped.
But now I seem to have found a solution. I am using a fit-generator and in the Keras Documentation (fit_generator) it mentions that:

...different batches may have different sizes. The last batch of the
epoch is commonly smaller than the others...

I still changed my generator to only output batches the right size. And voila, since then I dont get NaN and inf anymore.
Not sure if this helps everybody but I still want to post what helped me.

I am using a model.fit_generator() in keras, I got the problem that: train_loss is normal and decreasing, but the val_loss is inf. This is so strange, and I check everying and dont know why.
Finally, I change the code in customize data_generator:
def __len__(self):
        #return int(np.ceil(len(self.names) / float(self.batch_size)))
        return int(np.floor(len(self.names) / float(self.batch_size)))
This means that drop the last batch in data_generator, and this works!

Thank you so much! The NaN of val loss disappear. But my validation set is an integer multiple of the batch size. Why is this helpful?

p30arena · 2019-09-17T05:34:21Z

Thanks to @eng-tsmith
In my case, I had to use "fit_generator" instead of "fit",
Then I realized that I was passing only one ground truth for each batch, so I fixed it.

mdalvi · 2019-11-04T22:15:52Z

All discussions talk NaN in input data. I found my culprit in the output data. Fixed it by removing NaN from the regression target output.

khums · 2019-11-29T12:28:56Z

All discussions talk NaN in input data. I found my culprit in the output data. Fixed it by removing NaN from the regression target output.

How did you do that inside the keras loss function?

mcourteaux · 2019-12-01T17:46:55Z

I had the problem that my regularization loss became inf (infinite). This was clear because the actual prediction loss was still a nice float. The reason for me was that my activity regularization, containing a squaring operation, was getting some very large values: larger than the square root of the maximal IEEE floating point value, such that, after squaring it, the result became (aka "was rounded to") infinity.

khums · 2019-12-01T18:33:43Z

I had the problem that my regularization loss became inf (infinite). This was clear because the actual prediction loss was still a nice float. The reason for me was that my activity regularization, containing a squaring operation, was getting some very large values: larger than the square root of the maximal IEEE floating point value, such that, after squaring it, the result became (aka "was rounded to") infinity.

I had similar issues with my loss function, moreover the eigenvalue decomposition inside tensorflow for v 1.13.1 has the issue that if any singular value is encountered.. the loss function still remains NaN even if you filter Nan from the values and perform any aggregate operations.

Sawatdatta · 2020-01-06T16:09:00Z

in my case model. predic returns nan values. The model is already trained and i have saved weights. After compiling the model and loading weights the prediction is returning nan.

RCTimms · 2020-01-22T14:38:51Z

I was sometimes taking the log of a very small number somewhere in my cost function. I added a tiny amount of jitter to stop the output becoming -inf and a NaN being produced at the next update.

Hope this might help someone one day!

raykipa · 2020-01-29T09:20:35Z

I had the same error for a multiclass problem i was dealing with. At first i had my output layer with only 1 node and it was giving me an loss of nan, so i changed the output layer to the number of classes i had and it worked!!!

nishuai · 2020-02-12T10:43:27Z

I tried all the solutions mentioned above until I finally figured out that I put an intermediate backend layer to log transform a list that possibly contains 0 values, hope it helps.

vikasnataraja · 2020-02-24T07:16:15Z

I tried a lot of alternatives and it looks like the NaN loss can be caused by many different things. In my case, the Keras custom image generator (Keras Sequence) I wrote was the culprit. While generating batches of images, I had initialized the numpy array as np.empty((self.batch_size, self.resize_shape_tuple[0], self.resize_shape_tuple[1], self.num_channels)). I changed that to np.zeros() and it resolved the issue.

Sawatdatta · 2020-02-24T13:42:11Z

Retraining helped me to solve this problem.

…

On Mon, Feb 24, 2020 at 12:46 PM Vikas Nataraja ***@***.***> wrote: I tried a lot of alternatives and it looks like the NaN loss can be caused by many different things. In my case, the custom image generator I wrote was the culprit. While generating batches of images, I had initialized the numpy array as np.empty((self.batch_size, self.resize_shape_tuple[0], self.resize_shape_tuple[1], self.num_channels)). I changed that to np.zeros() and it resolved the issue. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2134?email_source=notifications&email_token=AGEUNATEXSPEMYVF33KXMCDRENX4HA5CNFSM4B7ONXMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMW2CPY#issuecomment-590192959>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEUNAV6OW7IENBAR2CNRGLRENX4HANCNFSM4B7ONXMA> .

-- *Mr. Dattatray D Sawat.* *Research Scholar,* *Department of Computer science,* *School of Computational Sciences,* *Solapur University, Solapur-413255.* *Ph:+918007163397*

yasersakkaf · 2020-02-27T04:44:51Z

Thanks @lhatsk, using RMSprop instead of SGD solved my problem

lambdavar · 2020-03-10T18:39:04Z

I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna())

I hope this helps someone encountering similar problem

This is nice, but dropping the whole row because of a nan value is bad for timeseries problems. I found that using df.fillna(0) gets better results!

Ldoun · 2020-10-08T16:22:18Z

My problem was that my y(target) value was [1,2] not [0,1] this made my loss value to negative and eventually it made an nan loss value
so check if your y value is correct

SevenUp92 · 2020-10-09T12:15:33Z

For me the core of the problem was, that I used "relu" as activation function in the LSTM-layer.
I replaced "relu" with "tanh" and it worked fine.

taesookim0412 · 2021-02-06T21:30:46Z

I had this problem, and my model would predict "NaNs" on any data, even though my losses were decreasing normally.
This probably means that there was corruption when it was processing the last batch. Therefore, I changed two things, I changed my model from outputting an activation to outputting a sequential layer (in mixed precision), although I don't think this was the cause of the problem. I also used the drop_remainder=True argument in Dataset.batch(). Now it doesn't mysteriously all go NaN after the first epoch. I'm not sure why this even happened, since it worked just fine with other activation functions.

saihtaungkham · 2021-03-25T09:37:58Z

In my case, the problem was the number of last output neuron doesn't match with the actual label size. Causing me headache for this bug. :)

XXZhou25 · 2021-03-25T17:19:13Z

I wanted to point this out so that it's archived for others who may experience this problem in future. I was running into my loss function suddenly returning a nan after it go so far into the training process. I checked the relus, the optimizer, the loss function, my dropout in accordance with the relus, the size of my network and the shape of the network. I was still getting loss that eventually turned into a nan and I was getting quite fustrated.

Then it dawned on me. I may have some bad input. It turns out, one of the images that I was handing to my CNN (and doing mean normalization on) was nothing but 0's. I wasn't checking for this case when I subtracted the mean and normalized by the std deviation and thus I ended up with an exemplar matrix which was nothing but nan's. Once I fixed my normalization function, my network now trains perfectly.

Appreciate! Looks like I got the same problem. Thanks!!!!!

paweller · 2021-04-14T19:47:41Z

For everyone out there who might face the nan issue when using the ELBO loss for VAEs here is what seems to have fixed it for me.

Short version:
Low learning rate of about 1e-4.
Thanks to @unnir who directed me into the right direction with his answer.

Long version:
I have tried countless things including

Making sure that there is no nan in the input data (np.any(np.isnan(data))
Normalizing the input data to the definition domain of sigmoid [0, 1], tanh [-1, 1], z-score (zero mean and unit variance).
Using different optimizers like Adam or RMSprop.
Regularizing the networks weights with kernel regularizers, batch normalization or a better suited way of normalizing layer weights for VAEs than the well know batch normalization.
Initializing kernels in different ways, namely GlorotUniform, RandomUniform and Zeros.

All of the above did not fix the issue. I finally tried to drop the learning rate from 1e-3 down to 1e-4 which made all the difference. No more nan so far. I can now train with very small to no L2 kernel normalization. The VAE now also trains with differently normalized training data (sigmoid, tanh, z-score).

Side note:
Here are some resources on VAEs and ELBO that I found quite useful.

darshankachhadiya · 2021-05-21T05:50:40Z

if your file have any missing value then you get loss = nan
so make sure that you handle missing values in data

pratyush3124 · 2021-10-25T08:16:16Z

Now i know no one would have made it this far into the post. But for those of you who still have this problem, here's a tip DON'T use Negative integers in your y data when using spare_categorical_crossentropy. That solved my problem.

yysw · 2021-11-09T08:16:37Z

In my case, the optimizer is the trick. Use adagrad instead of adam fix the problem. The problem only happened when I train the model distributedly with adam. In single node situation, adam works fine.

NomiMalik0207 · 2022-01-29T21:20:40Z

I am using Skelton (x,y) coordinates from MPII dataset. Changing relu with tanh worked for me.
Hopfully this will help anyone in future coz I have suffered a lot.

yysw · 2022-01-30T02:31:00Z

In my case, the optimizer is the trick. Use adagrad instead of adam fix the problem. The problem only happened when I train the model distributedly with adam. In single node situation, adam works fine.

Finally, tweak the epsilon parameter from 1e-8(default value) to 1e-6 fix the problem. This may cause the gradient decent slower, but the stability of training is more important.
See the post in stack overflow for detail: https://stackoverflow.com/questions/43221065/how-does-the-epsilon-hyperparameter-affect-tf-train-adamoptimizer

tonydifranco mentioned this issue Aug 8, 2017

nan loss after a number of steps (epochs) fferroni/PhasedLSTM-Keras#11

Closed

ashimb9 mentioned this issue Feb 17, 2018

Support dynamic partition in loss function tensorflow/tensorflow#14655

Closed

Sorooshi mentioned this issue Oct 25, 2018

Loss turns into 'nan' when running on GPU #1244

Closed

minhnhattonthat mentioned this issue Nov 9, 2018

Age Estimation training issue yu4u/age-gender-estimation#68

Closed

bwanglzu mentioned this issue Dec 5, 2018

DSSM returning NaN for loss when used with tensorflow-gpu backend. NTMC-Community/MatchZoo#493

Closed

fchollet closed this as completed Jun 24, 2021

SharifAmit mentioned this issue Sep 15, 2021

Question SharifAmit/RVGAN#5

Closed

shibbirtanvin mentioned this issue Feb 22, 2022

RFC keras-team/governance#34

Closed

JINHXu mentioned this issue Feb 26, 2023

NaN Loss Forecasting Model JINHXu/Early-Sepsis-Prediction-using-TSF#1

Open

NAN loss for regression while training #2134

NAN loss for regression while training #2134

Comments

indra215 commented Mar 30, 2016

NasenSpray commented Mar 30, 2016

indra215 commented Mar 30, 2016

jpeg729 commented Mar 30, 2016

lhatsk commented Mar 31, 2016

the-moliver commented Mar 31, 2016

cjnolet commented Dec 21, 2016

andrewssdd commented Mar 8, 2017

ylmeng commented May 19, 2017

mxh000 commented Jun 12, 2017

naisanza commented Aug 15, 2017 • edited Loading

brittohalloran commented Oct 7, 2017 • edited Loading

unnir commented Jan 4, 2018 • edited Loading

gavriel-merkado commented Feb 8, 2018

Mboga commented Feb 8, 2018

emoen commented Feb 28, 2018 • edited Loading

eng-tsmith commented Mar 7, 2018

claycoleman commented Mar 8, 2018

AloshkaD commented Mar 13, 2018

alyato commented Apr 19, 2018

Krithi07 commented May 31, 2018

ghost commented Jul 30, 2018

yosunpeng commented Aug 13, 2018

DracoScript commented Oct 17, 2018

Sorooshi commented Oct 25, 2018

rbahumi commented Nov 6, 2018

vijayakumar-govindarajulu commented Jul 16, 2019

ZhuoyaYang commented Aug 15, 2019

p30arena commented Sep 17, 2019 • edited Loading

mdalvi commented Nov 4, 2019 • edited Loading

khums commented Nov 29, 2019

mcourteaux commented Dec 1, 2019

khums commented Dec 1, 2019

Sawatdatta commented Jan 6, 2020

RCTimms commented Jan 22, 2020

raykipa commented Jan 29, 2020

nishuai commented Feb 12, 2020

vikasnataraja commented Feb 24, 2020 • edited Loading

Sawatdatta commented Feb 24, 2020 via email

yasersakkaf commented Feb 27, 2020 • edited Loading

lambdavar commented Mar 10, 2020 • edited Loading

Ldoun commented Oct 8, 2020 • edited Loading

SevenUp92 commented Oct 9, 2020

taesookim0412 commented Feb 6, 2021

saihtaungkham commented Mar 25, 2021

XXZhou25 commented Mar 25, 2021

paweller commented Apr 14, 2021 • edited Loading

darshankachhadiya commented May 21, 2021

pratyush3124 commented Oct 25, 2021

yysw commented Nov 9, 2021

NomiMalik0207 commented Jan 29, 2022 • edited Loading

yysw commented Jan 30, 2022

naisanza commented Aug 15, 2017 •

edited

Loading

brittohalloran commented Oct 7, 2017 •

edited

Loading

unnir commented Jan 4, 2018 •

edited

Loading

emoen commented Feb 28, 2018 •

edited

Loading

p30arena commented Sep 17, 2019 •

edited

Loading

mdalvi commented Nov 4, 2019 •

edited

Loading

vikasnataraja commented Feb 24, 2020 •

edited

Loading

yasersakkaf commented Feb 27, 2020 •

edited

Loading

lambdavar commented Mar 10, 2020 •

edited

Loading

Ldoun commented Oct 8, 2020 •

edited

Loading

paweller commented Apr 14, 2021 •

edited

Loading

NomiMalik0207 commented Jan 29, 2022 •

edited

Loading