New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss turns into 'nan' when running on GPU #1244

Open
lmoesch opened this Issue Dec 11, 2015 · 62 comments

Comments

Projects
None yet
@lmoesch

lmoesch commented Dec 11, 2015

Like previously stated in issue #511 Keras runs into not a number losses while training on GPU.
Tested this with the mnist_cnn example code aswell as with self designed conv networks. I also tried to disable cuDnn, aswell as increasing the epsilon and setting a clinorm. Nothing solved the poblem.

I'm using the latest version of Theano and Keras. And SGD optimisation with categorical crossentropy.

Graphics: GTX 980 Ti

@fchollet

This comment has been minimized.

Show comment
Hide comment
@fchollet

fchollet Dec 11, 2015

Collaborator

I'd like to identify what op is causing this issue.

  • post your code here.
  • try to use a different loss than categorical crossentropy, e.g. MSE
Collaborator

fchollet commented Dec 11, 2015

I'd like to identify what op is causing this issue.

  • post your code here.
  • try to use a different loss than categorical crossentropy, e.g. MSE
@lmoesch

This comment has been minimized.

Show comment
Hide comment
@lmoesch

lmoesch Dec 11, 2015

Here is the net-part of my code. I'll try other loss functions, but they take some time to provide useful evidence, since you can't determine when the loss turns into 'nan'.

img_rows = img_cols = 128
img_channels = 3
l1 = l2 = 0

# convert data for GPU use
X_train = X_train.astype("float32")
X_test = X_test.astype("float32")
X_train /= 255
X_test /= 255

# convert class vectors to binary class matrices
y_train = np_utils.to_categorical(y_train, nb_classes)
y_test = np_utils.to_categorical(y_test, nb_classes)

model = Sequential()

model.add(Convolution2D(16, 5, 5, border_mode='same',
                        input_shape=(img_channels, img_rows, img_cols), W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Convolution2D(16, 3, 3, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Convolution2D(32, 3, 3, border_mode='valid', W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Convolution2D(32, 3, 3, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Convolution2D(64, 3, 3, border_mode='valid', W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Flatten())
model.add(Dense(1024, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Dropout(0.6))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)

model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch,shuffle=True, show_accuracy=True, callbacks=[history])

lmoesch commented Dec 11, 2015

Here is the net-part of my code. I'll try other loss functions, but they take some time to provide useful evidence, since you can't determine when the loss turns into 'nan'.

img_rows = img_cols = 128
img_channels = 3
l1 = l2 = 0

# convert data for GPU use
X_train = X_train.astype("float32")
X_test = X_test.astype("float32")
X_train /= 255
X_test /= 255

# convert class vectors to binary class matrices
y_train = np_utils.to_categorical(y_train, nb_classes)
y_test = np_utils.to_categorical(y_test, nb_classes)

model = Sequential()

model.add(Convolution2D(16, 5, 5, border_mode='same',
                        input_shape=(img_channels, img_rows, img_cols), W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Convolution2D(16, 3, 3, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Convolution2D(32, 3, 3, border_mode='valid', W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Convolution2D(32, 3, 3, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Convolution2D(64, 3, 3, border_mode='valid', W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Flatten())
model.add(Dense(1024, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Dropout(0.6))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)

model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch,shuffle=True, show_accuracy=True, callbacks=[history])
@hr0nix

This comment has been minimized.

Show comment
Hide comment
@hr0nix

hr0nix Dec 12, 2015

As far as I know, it's the combination of relu and softmax that causes numerical troubles, as relu can produce large positive values corresponding to very small probabilities. If you change your model to use, say, tanh instead of relu for the last dense layer, the problem will go away.

hr0nix commented Dec 12, 2015

As far as I know, it's the combination of relu and softmax that causes numerical troubles, as relu can produce large positive values corresponding to very small probabilities. If you change your model to use, say, tanh instead of relu for the last dense layer, the problem will go away.

@lmoesch

This comment has been minimized.

Show comment
Hide comment
@lmoesch

lmoesch Dec 13, 2015

I first tested the 'tanh' activation wich didn't helped. It was no suprise though, since this problem is specific to calculations on the GPU and not a general one with numerical stability.

I also tried mse as loss function, which ran into 'nan' aswell.

lmoesch commented Dec 13, 2015

I first tested the 'tanh' activation wich didn't helped. It was no suprise though, since this problem is specific to calculations on the GPU and not a general one with numerical stability.

I also tried mse as loss function, which ran into 'nan' aswell.

@fchollet

This comment has been minimized.

Show comment
Hide comment
@fchollet

fchollet Dec 14, 2015

Collaborator

I also tried mse as loss function, which ran into 'nan' aswell.

In that case the overflow is happening earlier in the graph.

Next, you could try removing the regularizers.

BTW the history callback is included by default, no need to specify it manually.

Collaborator

fchollet commented Dec 14, 2015

I also tried mse as loss function, which ran into 'nan' aswell.

In that case the overflow is happening earlier in the graph.

Next, you could try removing the regularizers.

BTW the history callback is included by default, no need to specify it manually.

@lmoesch

This comment has been minimized.

Show comment
Hide comment
@lmoesch

lmoesch Dec 14, 2015

Correct me if I'm wrong, but with l1 = l2 = 0 it should not matter that l1l2-regulizer is defined?

But I'll try to remove them.

lmoesch commented Dec 14, 2015

Correct me if I'm wrong, but with l1 = l2 = 0 it should not matter that l1l2-regulizer is defined?

But I'll try to remove them.

@fchollet

This comment has been minimized.

Show comment
Hide comment
@fchollet

fchollet Dec 14, 2015

Collaborator

Of course, it should not matter. Also, there should not be a float32 overflow.

Collaborator

fchollet commented Dec 14, 2015

Of course, it should not matter. Also, there should not be a float32 overflow.

@lmoesch

This comment has been minimized.

Show comment
Hide comment
@lmoesch

lmoesch Dec 21, 2015

Okay, I removed all W-regulizers and the 'nan' loss still occurs.

I noticed that the netloss is more likely to output 'nan' when used deeper (e.g. 8 conv layers instead of 6) and wider (e.g. 512 feature maps) networks.

lmoesch commented Dec 21, 2015

Okay, I removed all W-regulizers and the 'nan' loss still occurs.

I noticed that the netloss is more likely to output 'nan' when used deeper (e.g. 8 conv layers instead of 6) and wider (e.g. 512 feature maps) networks.

@felipefariax

This comment has been minimized.

Show comment
Hide comment
@felipefariax

felipefariax Dec 24, 2015

I am also having problem with nans on loss.

I realized that the weights became nan, but I dont know if this changed before or after the loss calculation. Which is very strange, since the values to calculate crossentropy is clipped before apply the objective function...

felipefariax commented Dec 24, 2015

I am also having problem with nans on loss.

I realized that the weights became nan, but I dont know if this changed before or after the loss calculation. Which is very strange, since the values to calculate crossentropy is clipped before apply the objective function...

@qqgeogor

This comment has been minimized.

Show comment
Hide comment
@qqgeogor

qqgeogor Dec 25, 2015

The same thing happened to me instead I was using keras to build a regression model. I have tried different loss(rmse or mae) and also sigmoid tanh apart from relu. Nothing helps to imporve this case.

qqgeogor commented Dec 25, 2015

The same thing happened to me instead I was using keras to build a regression model. I have tried different loss(rmse or mae) and also sigmoid tanh apart from relu. Nothing helps to imporve this case.

@felipefariax

This comment has been minimized.

Show comment
Hide comment
@felipefariax

felipefariax Dec 28, 2015

I think I've fixed it on the PR #1368.
@fchollet what do you think?

felipefariax commented Dec 28, 2015

I think I've fixed it on the PR #1368.
@fchollet what do you think?

@lmoesch

This comment has been minimized.

Show comment
Hide comment
@lmoesch

lmoesch Jan 6, 2016

I get your point on preventing division by zero, but this doesn't explain why this problem is specific to certain GPU especially the GTX 9XX series. (Never had a problem on my GTX 670).

lmoesch commented Jan 6, 2016

I get your point on preventing division by zero, but this doesn't explain why this problem is specific to certain GPU especially the GTX 9XX series. (Never had a problem on my GTX 670).

@tylerklement

This comment has been minimized.

Show comment
Hide comment
@tylerklement

tylerklement Jul 7, 2016

I'm getting the same problem. I think this is a problem with the system configuration more so than with the code. My code used to work, but then I had to reformat my computer and reinstall everything, and now I'm getting "nan" loss. So I think it's something with the configuration of Theano, CUDA, Visual Studio, or CuDNN, at least in my case. Still trying to figure it out.

tylerklement commented Jul 7, 2016

I'm getting the same problem. I think this is a problem with the system configuration more so than with the code. My code used to work, but then I had to reformat my computer and reinstall everything, and now I'm getting "nan" loss. So I think it's something with the configuration of Theano, CUDA, Visual Studio, or CuDNN, at least in my case. Still trying to figure it out.

@azzever

This comment has been minimized.

Show comment
Hide comment
@azzever

azzever Jul 7, 2016

I'm also getting this problem (Ubuntu 14.04, GTX 980Ti/970, Theano as backend, CNN with residual units, ReLU, BN, mse/mae loss).

In my case problem occurred randomly, the probability of getting nan is increasing with model's complexity (and memory usage). When loss become nan loading of saved weights doesn't help to continue training (weights become corrupted on first training iteration). Only recompilation or creating new model allow to continue training.

azzever commented Jul 7, 2016

I'm also getting this problem (Ubuntu 14.04, GTX 980Ti/970, Theano as backend, CNN with residual units, ReLU, BN, mse/mae loss).

In my case problem occurred randomly, the probability of getting nan is increasing with model's complexity (and memory usage). When loss become nan loading of saved weights doesn't help to continue training (weights become corrupted on first training iteration). Only recompilation or creating new model allow to continue training.

@tylerklement

This comment has been minimized.

Show comment
Hide comment
@tylerklement

tylerklement Jul 7, 2016

It works for me now. I had installed cuDNN incorrectly - previously I had just dragged the cuDNN files and dropped them in the CUDA folder, replacing anything with the same name. So I re-installed Visual Studio (2013), Anaconda, Theano, and Keras. It still gave me "nan". So then, I installed cuDNN, but this time, I did this by extracting the cuDNN files to their own directory, and then just added that directory to my path. I think that was the key factor for me: installing cuDNN (properly). I was using relu and adam the whole time.

tylerklement commented Jul 7, 2016

It works for me now. I had installed cuDNN incorrectly - previously I had just dragged the cuDNN files and dropped them in the CUDA folder, replacing anything with the same name. So I re-installed Visual Studio (2013), Anaconda, Theano, and Keras. It still gave me "nan". So then, I installed cuDNN, but this time, I did this by extracting the cuDNN files to their own directory, and then just added that directory to my path. I think that was the key factor for me: installing cuDNN (properly). I was using relu and adam the whole time.

@djstrong

This comment has been minimized.

Show comment
Hide comment
@djstrong

djstrong Aug 14, 2016

The same problem on Tesla M2090. I tried consume_less gpu and cpu. GRU is working ok.

djstrong commented Aug 14, 2016

The same problem on Tesla M2090. I tried consume_less gpu and cpu. GRU is working ok.

@WenchenLi

This comment has been minimized.

Show comment
Hide comment
@WenchenLi

WenchenLi Aug 18, 2016

Anybody any progress on this issue?

WenchenLi commented Aug 18, 2016

Anybody any progress on this issue?

@9thDimension

This comment has been minimized.

Show comment
Hide comment
@9thDimension

9thDimension Aug 18, 2016

I had this problem - nets that worked perfectly fine on various CPU hardware failed to train on AWS GPU-enabled remote machine.

I removed Theano 0.8.0, and upgraded to the bleeding-edge version from GitHub (which is 0.9.0-dev2). Now training works perfectly.

Can't blame this one on Keras, folks!

9thDimension commented Aug 18, 2016

I had this problem - nets that worked perfectly fine on various CPU hardware failed to train on AWS GPU-enabled remote machine.

I removed Theano 0.8.0, and upgraded to the bleeding-edge version from GitHub (which is 0.9.0-dev2). Now training works perfectly.

Can't blame this one on Keras, folks!

@djstrong

This comment has been minimized.

Show comment
Hide comment
@djstrong

djstrong Aug 18, 2016

On CPU I am getting nans too, but after more epochs than on GPU.

djstrong commented Aug 18, 2016

On CPU I am getting nans too, but after more epochs than on GPU.

@ersinyar

This comment has been minimized.

Show comment
Hide comment
@ersinyar

ersinyar Sep 19, 2016

I have the same problem. I train an LSTM network with my own data. The train_loss becomes NaN suddenly. I checked my code with imdb dataset. It is working OK. But when I switch to my dataset nan problem occurs. I preprocessed my data in the same way that imdb dataset preprocessed in imdb_lstm example of keras. I do not understand what the problem is. It seems that network configuration is OK since it run with another dataset. However, my dataset and imdb dataset are both text. How come does another text dataset cause this issue? I tried gradient clipping also weight norm limitations. I think sudden change happens when inf value is calculated with categorical_cross entropy function such as log(0). But how can I determine and avoid this problem?

ersinyar commented Sep 19, 2016

I have the same problem. I train an LSTM network with my own data. The train_loss becomes NaN suddenly. I checked my code with imdb dataset. It is working OK. But when I switch to my dataset nan problem occurs. I preprocessed my data in the same way that imdb dataset preprocessed in imdb_lstm example of keras. I do not understand what the problem is. It seems that network configuration is OK since it run with another dataset. However, my dataset and imdb dataset are both text. How come does another text dataset cause this issue? I tried gradient clipping also weight norm limitations. I think sudden change happens when inf value is calculated with categorical_cross entropy function such as log(0). But how can I determine and avoid this problem?

@eyal-str

This comment has been minimized.

Show comment
Hide comment
@eyal-str

eyal-str Sep 26, 2016

I also had this problem. I fixed it when I changed the Y values to float numbers. For example 0.0 1.0 instead of 0 and 1.

eyal-str commented Sep 26, 2016

I also had this problem. I fixed it when I changed the Y values to float numbers. For example 0.0 1.0 instead of 0 and 1.

@pendragvn

This comment has been minimized.

Show comment
Hide comment
@pendragvn

pendragvn Oct 1, 2016

like @9thDimension said, upgrading Theano to the bleeding-edge version (0.9.0-dev2) seems to have fixed the nan issues for me so far on debian wheezy. i'm using a python 3.5.2 env in anaconda 4.1.1.

i just followed the instructions from the theano website here:
pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git

pendragvn commented Oct 1, 2016

like @9thDimension said, upgrading Theano to the bleeding-edge version (0.9.0-dev2) seems to have fixed the nan issues for me so far on debian wheezy. i'm using a python 3.5.2 env in anaconda 4.1.1.

i just followed the instructions from the theano website here:
pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git

@skerit

This comment has been minimized.

Show comment
Hide comment
@skerit

skerit Oct 7, 2016

I'm training on the CPU and using tensorflow as backend, also getting the nan issue.

skerit commented Oct 7, 2016

I'm training on the CPU and using tensorflow as backend, also getting the nan issue.

@svolchkov

This comment has been minimized.

Show comment
Hide comment
@svolchkov

svolchkov Oct 8, 2016

Was having the same issue with a regression task. Upgrading Theano didn't work but changing the optimizer from 'sgd' to 'rmsprop' seemed to help.

svolchkov commented Oct 8, 2016

Was having the same issue with a regression task. Upgrading Theano didn't work but changing the optimizer from 'sgd' to 'rmsprop' seemed to help.

@patrick-ogrady

This comment has been minimized.

Show comment
Hide comment
@patrick-ogrady

patrick-ogrady Nov 3, 2016

@skerit did you figure it out?

patrick-ogrady commented Nov 3, 2016

@skerit did you figure it out?

@patricio-astudillo

This comment has been minimized.

Show comment
Hide comment
@patricio-astudillo

patricio-astudillo Nov 23, 2016

I had the nan-problem as wel and I solved it by changing the floatx value in ~/.keras/keras.json from float32 to float64. (tested on GPU)

This is the description of my setup:
Backend: tensorflow and theano
Optimizer: Adam
GPU: Titan X and GTX 970
Activations: RELU
Last layer activation: sigmoid
Objective: binary cross entropy

If more details are needed, let me know.

EDIT: the problem was not solved by this but the training lasted longer so an acceptable loss variable was reached
EDIT2: after re-reading the images and saving them, the training lasted even longer

patricio-astudillo commented Nov 23, 2016

I had the nan-problem as wel and I solved it by changing the floatx value in ~/.keras/keras.json from float32 to float64. (tested on GPU)

This is the description of my setup:
Backend: tensorflow and theano
Optimizer: Adam
GPU: Titan X and GTX 970
Activations: RELU
Last layer activation: sigmoid
Objective: binary cross entropy

If more details are needed, let me know.

EDIT: the problem was not solved by this but the training lasted longer so an acceptable loss variable was reached
EDIT2: after re-reading the images and saving them, the training lasted even longer

@nouiz

This comment has been minimized.

Show comment
Hide comment
@nouiz

nouiz Nov 23, 2016

Contributor

Did you time it? It if you use the current GPU backend, this cause all
competition to be on the CPU.

Le 23 nov. 2016 06:56, "Patricio Astudillo" notifications@github.com a
écrit :

I had the nan-problem as wel and I solved it by changing the floatx value
in ~/.keras/keras.json from float32 to float64.
This is the description of my setup:
Backend: tensorflow and theano
Optimizer: Adam
GPU: Titan X and GTX 970
Activation: RELU
Last layer activation: sigmoid
Objective: binary cross entropy
If details are needed, let me know.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1244 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AALC--4Nl68nV3NmAV1WLgiZKfSjHYilks5rBCoHgaJpZM4GzdUg
.

Contributor

nouiz commented Nov 23, 2016

Did you time it? It if you use the current GPU backend, this cause all
competition to be on the CPU.

Le 23 nov. 2016 06:56, "Patricio Astudillo" notifications@github.com a
écrit :

I had the nan-problem as wel and I solved it by changing the floatx value
in ~/.keras/keras.json from float32 to float64.
This is the description of my setup:
Backend: tensorflow and theano
Optimizer: Adam
GPU: Titan X and GTX 970
Activation: RELU
Last layer activation: sigmoid
Objective: binary cross entropy
If details are needed, let me know.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1244 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AALC--4Nl68nV3NmAV1WLgiZKfSjHYilks5rBCoHgaJpZM4GzdUg
.

@farizrahman4u

This comment has been minimized.

Show comment
Hide comment
@farizrahman4u

farizrahman4u Nov 24, 2016

Member

I was having the same issue. Disabled, CudNN (optimizer_exclude=cudnn); everything works fine. And slow.

Member

farizrahman4u commented Nov 24, 2016

I was having the same issue. Disabled, CudNN (optimizer_exclude=cudnn); everything works fine. And slow.

@jphalip

This comment has been minimized.

Show comment
Hide comment
@jphalip

jphalip Dec 10, 2016

Contributor

I too ran into a similar issue where the loss and layer weights would suddenly be set to nan during training with floatx as float32 (it worked fine with float64 but was much slower).

I was able to fix this by applying either the clipnorm or clipvalue optimizer attributes (https://keras.io/optimizers/#parameters-common-to-all-keras-optimizers). It seems that for me this was a case of exploding gradients, which may not be true for all cases reported here. I just thought I'd mention what worked for me in case that's helpful to others.

Contributor

jphalip commented Dec 10, 2016

I too ran into a similar issue where the loss and layer weights would suddenly be set to nan during training with floatx as float32 (it worked fine with float64 but was much slower).

I was able to fix this by applying either the clipnorm or clipvalue optimizer attributes (https://keras.io/optimizers/#parameters-common-to-all-keras-optimizers). It seems that for me this was a case of exploding gradients, which may not be true for all cases reported here. I just thought I'd mention what worked for me in case that's helpful to others.

@svolchkov

This comment has been minimized.

Show comment
Hide comment
@svolchkov

svolchkov Dec 12, 2016

I used clipnorm, too, and it allowed me to use the adam optimizer. I wonder if using clipnorm might have a negative impact on the accuracy.

svolchkov commented Dec 12, 2016

I used clipnorm, too, and it allowed me to use the adam optimizer. I wonder if using clipnorm might have a negative impact on the accuracy.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jan 22, 2017

i know that clipnorm fix this issue and i know that clipnorm clip the big number of the gradients but i want to know why the nan is produced?
why do i see loss=nan when i don't use clipping of the gradient?

ghost commented Jan 22, 2017

i know that clipnorm fix this issue and i know that clipnorm clip the big number of the gradients but i want to know why the nan is produced?
why do i see loss=nan when i don't use clipping of the gradient?

@sergsb

This comment has been minimized.

Show comment
Hide comment
@sergsb

sergsb Jan 25, 2017

I have the same issue during training on GPU 3D Convolutional network. I use float32 and theano backend.

sergsb commented Jan 25, 2017

I have the same issue during training on GPU 3D Convolutional network. I use float32 and theano backend.

@monajalal

This comment has been minimized.

Show comment
Hide comment
@monajalal

monajalal Jan 31, 2017

I have the following in ~/.keras/keras.json:

  1 {
  2     "floatx": "float32",
  3     "epsilon": 1e-07,
  4     "image_dim_ordering": "tf",
  5     "backend": "tensorflow"
  6 }

and I got nan:

mona@pascal:~/computer_vision/VPilot$ python train.py 
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:1938: UserWarning: Expected no kwargs, you passed 1
kwargs passed to function are ignored with Tensorflow backend
  warnings.warn('\n'.join(msg))
Epoch 1/1000
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.8755
pciBusID 0000:03:00.0
Total memory: 11.92GiB
Free memory: 11.85GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x4750d80
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 1 with properties: 
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.8755
pciBusID 0000:83:00.0
Total memory: 11.92GiB
Free memory: 11.85GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1:   N Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:03:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40c, pci bus id: 0000:83:00.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 4777 get requests, put_count=3270 evicted_count=1000 eviction_rate=0.30581 and unsatisfied allocation rate=0.54574
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 100 to 110
    4/70629 [..............................] - ETA: 364851s - loss: 0.5890I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 755 get requests, put_count=1771 evicted_count=1000 eviction_rate=0.564653 and unsatisfied allocation rate=0
    8/70629 [..............................] - ETA: 194931s - loss: 0.5553I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 247 get requests, put_count=1270 evicted_count=1000 eviction_rate=0.787402 and unsatisfied allocation rate=0
   13/70629 [..............................] - ETA: 129454s - loss: 0.5582I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5071 get requests, put_count=4961 evicted_count=2000 eviction_rate=0.403145 and unsatisfied allocation rate=0.423979
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 449 to 493
   18/70629 [..............................] - ETA: 100341s - loss: 0.5194I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5145 get requests, put_count=5327 evicted_count=2000 eviction_rate=0.375446 and unsatisfied allocation rate=0.365986
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 720 to 792
   25/70629 [..............................] - ETA: 79355s - loss: 0.5875I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5137 get requests, put_count=5388 evicted_count=1000 eviction_rate=0.185598 and unsatisfied allocation rate=0.175784
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 1694 to 1863
70629/70629 [==============================] - 25358s - loss: nan - val_loss: nan
Epoch 2/1000
70629/70629 [==============================] - 24899s - loss: nan - val_loss: nan
Epoch 3/1000
70629/70629 [==============================] - 24967s - loss: nan - val_loss: nan
Epoch 4/1000
70629/70629 [==============================] - 24987s - loss: nan - val_loss: nan
Epoch 5/1000
70629/70629 [==============================] - 24855s - loss: nan - val_loss: nan
Epoch 6/1000
70629/70629 [==============================] - 24977s - loss: nan - val_loss: nan


monajalal commented Jan 31, 2017

I have the following in ~/.keras/keras.json:

  1 {
  2     "floatx": "float32",
  3     "epsilon": 1e-07,
  4     "image_dim_ordering": "tf",
  5     "backend": "tensorflow"
  6 }

and I got nan:

mona@pascal:~/computer_vision/VPilot$ python train.py 
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:1938: UserWarning: Expected no kwargs, you passed 1
kwargs passed to function are ignored with Tensorflow backend
  warnings.warn('\n'.join(msg))
Epoch 1/1000
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.8755
pciBusID 0000:03:00.0
Total memory: 11.92GiB
Free memory: 11.85GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x4750d80
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 1 with properties: 
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.8755
pciBusID 0000:83:00.0
Total memory: 11.92GiB
Free memory: 11.85GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1:   N Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:03:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40c, pci bus id: 0000:83:00.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 4777 get requests, put_count=3270 evicted_count=1000 eviction_rate=0.30581 and unsatisfied allocation rate=0.54574
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 100 to 110
    4/70629 [..............................] - ETA: 364851s - loss: 0.5890I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 755 get requests, put_count=1771 evicted_count=1000 eviction_rate=0.564653 and unsatisfied allocation rate=0
    8/70629 [..............................] - ETA: 194931s - loss: 0.5553I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 247 get requests, put_count=1270 evicted_count=1000 eviction_rate=0.787402 and unsatisfied allocation rate=0
   13/70629 [..............................] - ETA: 129454s - loss: 0.5582I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5071 get requests, put_count=4961 evicted_count=2000 eviction_rate=0.403145 and unsatisfied allocation rate=0.423979
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 449 to 493
   18/70629 [..............................] - ETA: 100341s - loss: 0.5194I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5145 get requests, put_count=5327 evicted_count=2000 eviction_rate=0.375446 and unsatisfied allocation rate=0.365986
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 720 to 792
   25/70629 [..............................] - ETA: 79355s - loss: 0.5875I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5137 get requests, put_count=5388 evicted_count=1000 eviction_rate=0.185598 and unsatisfied allocation rate=0.175784
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 1694 to 1863
70629/70629 [==============================] - 25358s - loss: nan - val_loss: nan
Epoch 2/1000
70629/70629 [==============================] - 24899s - loss: nan - val_loss: nan
Epoch 3/1000
70629/70629 [==============================] - 24967s - loss: nan - val_loss: nan
Epoch 4/1000
70629/70629 [==============================] - 24987s - loss: nan - val_loss: nan
Epoch 5/1000
70629/70629 [==============================] - 24855s - loss: nan - val_loss: nan
Epoch 6/1000
70629/70629 [==============================] - 24977s - loss: nan - val_loss: nan


@univ12

This comment has been minimized.

Show comment
Hide comment
@univ12

univ12 Feb 6, 2017

I have this problem on keras 1.2.1 and theano 0.9.0b1. My epochs are already starting with nan. Adding a clipvalue=1, changing the learning rate and trying different optimizers did not help.

univ12 commented Feb 6, 2017

I have this problem on keras 1.2.1 and theano 0.9.0b1. My epochs are already starting with nan. Adding a clipvalue=1, changing the learning rate and trying different optimizers did not help.

@vwrs

This comment has been minimized.

Show comment
Hide comment
@vwrs

vwrs Feb 14, 2017

Contributor

I also have the same issue training LSTM network by multi_gpu.py, using mse as loss function.

Contributor

vwrs commented Feb 14, 2017

I also have the same issue training LSTM network by multi_gpu.py, using mse as loss function.

@vvpreetham

This comment has been minimized.

Show comment
Hide comment
@vvpreetham

vvpreetham Feb 15, 2017

I get NaN for a linear regressor at the time of model.evaluate for Adam or a tf backend FTRL optimizer. Have tried changing parameter size of the NN arch, played around with learning_rates, regularizers, clipping etc.. No luck. I am running on 3 Tesla-X GPU.

(BTW, happens only when I allocate more than 1 GPU)

vvpreetham commented Feb 15, 2017

I get NaN for a linear regressor at the time of model.evaluate for Adam or a tf backend FTRL optimizer. Have tried changing parameter size of the NN arch, played around with learning_rates, regularizers, clipping etc.. No luck. I am running on 3 Tesla-X GPU.

(BTW, happens only when I allocate more than 1 GPU)

@jmaronas

This comment has been minimized.

Show comment
Hide comment
@jmaronas

jmaronas Mar 2, 2017

I will post my experience and my solution. One of the key things of saturation has really nothing to do with the cost or the parameters, but the updates. The softmax function has some tricks for preventing overflow when the previous layer is a ReLu that can output high values. Im really sure theano implements this tricks but I have not checked it.

So normally desactivating cudNN can solve the problem. I experience this problem on a classification convolutional neural network and on a MSE fully connected neural network. With good parameter initialization and data normalization I have saturation (did not initialize the weights with values in a 10³ order for example). First deactivate cudNN as this makes approximations.

Then, with the same code I experienced saturation by changing the theano version. In one theano version it does not saturates and in the other it does. Moreover depending on the GPU I also see saturation. On a GTX 1070 I have more saturation than on a GTX 1080. Hopefully with the new theano back-end we will have float64 adaptation but for the moment it does not seem to happen.

So finally the way I solved this is by scaling the cost function. Saturation sometimes happens because on an early layer the derivative respect to a weight is a combination of a sumation of mini-batch size (and as soon as you go early in your network more sumations influence). Lots of sums can make higher values that end up making a saturated update. Since scaling is a monotic transformation it would not change the optimization point. Simple take your cost and scale it by 0.00001 as example. This solved my problem.

Remark my sum of squared error was normalized by batch size and by my factor. Hope this helps.

jmaronas commented Mar 2, 2017

I will post my experience and my solution. One of the key things of saturation has really nothing to do with the cost or the parameters, but the updates. The softmax function has some tricks for preventing overflow when the previous layer is a ReLu that can output high values. Im really sure theano implements this tricks but I have not checked it.

So normally desactivating cudNN can solve the problem. I experience this problem on a classification convolutional neural network and on a MSE fully connected neural network. With good parameter initialization and data normalization I have saturation (did not initialize the weights with values in a 10³ order for example). First deactivate cudNN as this makes approximations.

Then, with the same code I experienced saturation by changing the theano version. In one theano version it does not saturates and in the other it does. Moreover depending on the GPU I also see saturation. On a GTX 1070 I have more saturation than on a GTX 1080. Hopefully with the new theano back-end we will have float64 adaptation but for the moment it does not seem to happen.

So finally the way I solved this is by scaling the cost function. Saturation sometimes happens because on an early layer the derivative respect to a weight is a combination of a sumation of mini-batch size (and as soon as you go early in your network more sumations influence). Lots of sums can make higher values that end up making a saturated update. Since scaling is a monotic transformation it would not change the optimization point. Simple take your cost and scale it by 0.00001 as example. This solved my problem.

Remark my sum of squared error was normalized by batch size and by my factor. Hope this helps.

@vvpreetham

This comment has been minimized.

Show comment
Hide comment
@vvpreetham

vvpreetham Mar 6, 2017

An update on the bug, (I run this on Tesla-X GPU).

I do consistently get the error when I use sample_weights. The model has a sparse input size of about 8000 neurons and the first layer is an SQRT reduction of the size.

with tf.device('/cpu:0'):
    width = wide_array_width(wide_col_len_dict)
    reduction = wide_reduce(width)        
    model = Sequential()
    model.add(Dense(reduction, input_dim=width, activation='softplus'))
    if(middle_layer):
        model.add(Dense(wide_reduce(reduction), activation='softplus',W_constraint = maxnorm(2)))
    #final_layer              
    model.add(Dense(1, init='normal',activation='sigmoid'))                                           model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=0.001,
                                                         beta_1=0.9, 
                                                         beta_2=0.999, 
                                                         epsilon=10e-04, 
                                                         decay=0.0,
                                                         clipnorm=1.0,
                                                         clipvalue=0.3))

The model trains if I comment out the sample_weight section (But trains the model horribly wrong)

    hist = model.fit(input_dense_matrix, 
                     labels, 
                     nb_epoch=train_steps, 
                     verbose=0, 
                     shuffle=True,
                     validation_split=0.2,
                     batch_size=60,
                     sample_weight=sample_weights_,
                     callbacks=[early_stopping, checkpointer])

vvpreetham commented Mar 6, 2017

An update on the bug, (I run this on Tesla-X GPU).

I do consistently get the error when I use sample_weights. The model has a sparse input size of about 8000 neurons and the first layer is an SQRT reduction of the size.

with tf.device('/cpu:0'):
    width = wide_array_width(wide_col_len_dict)
    reduction = wide_reduce(width)        
    model = Sequential()
    model.add(Dense(reduction, input_dim=width, activation='softplus'))
    if(middle_layer):
        model.add(Dense(wide_reduce(reduction), activation='softplus',W_constraint = maxnorm(2)))
    #final_layer              
    model.add(Dense(1, init='normal',activation='sigmoid'))                                           model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=0.001,
                                                         beta_1=0.9, 
                                                         beta_2=0.999, 
                                                         epsilon=10e-04, 
                                                         decay=0.0,
                                                         clipnorm=1.0,
                                                         clipvalue=0.3))

The model trains if I comment out the sample_weight section (But trains the model horribly wrong)

    hist = model.fit(input_dense_matrix, 
                     labels, 
                     nb_epoch=train_steps, 
                     verbose=0, 
                     shuffle=True,
                     validation_split=0.2,
                     batch_size=60,
                     sample_weight=sample_weights_,
                     callbacks=[early_stopping, checkpointer])
@silentsnooc

This comment has been minimized.

Show comment
Hide comment
@silentsnooc

silentsnooc Mar 25, 2017

I am having the same issue for this network:

class FeedForward:

    def __init__(self, input_dim, nb_classes):

        in_x = Input(shape=(input_dim, ), name='in_x')
        h1 = Dense(14, name='h1', activation='tanh')(in_x)
        h2 = Dense(8, name='h2', activation='tanh')(h1)
        out = Dense(nb_classes, name='out', activation='tanh')(h2)

        self.model = Model(input=[in_x], output=[out])

    def compile_model(self, optimizer='adam', loss='mse'):
        self.model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

loss will always be nan unless I wrap everything into with tf.device('/cpu:0'): and run the calculations on the CPU.

silentsnooc commented Mar 25, 2017

I am having the same issue for this network:

class FeedForward:

    def __init__(self, input_dim, nb_classes):

        in_x = Input(shape=(input_dim, ), name='in_x')
        h1 = Dense(14, name='h1', activation='tanh')(in_x)
        h2 = Dense(8, name='h2', activation='tanh')(h1)
        out = Dense(nb_classes, name='out', activation='tanh')(h2)

        self.model = Model(input=[in_x], output=[out])

    def compile_model(self, optimizer='adam', loss='mse'):
        self.model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

loss will always be nan unless I wrap everything into with tf.device('/cpu:0'): and run the calculations on the CPU.

@MClarkTurner

This comment has been minimized.

Show comment
Hide comment
@MClarkTurner

MClarkTurner Mar 30, 2017

I'm having a similar issue with my new Titan X running on TF 1.0.1 using CUDA 8.0 and CuDNN 5.1.10. I have tried clipping the gradients but I've had no luck. My model works fine on CPU but within 100 iterations of mini-batches size 10 I inevitably get NaN values when running on my GPU.

Is it possible this is a problem with my installation of CUDA, CuDNN, of TF? I've tried downloading TF from source to no avail. Has any one had any luck going back to CUDA 7.5 and CuDNN 4?

EDIT: So after a lot of work I found out that this was an error with my code and not with the architecture. Apparently nans can become more prevalent depending on your environment but at its core this seems to be an issue with my model.

MClarkTurner commented Mar 30, 2017

I'm having a similar issue with my new Titan X running on TF 1.0.1 using CUDA 8.0 and CuDNN 5.1.10. I have tried clipping the gradients but I've had no luck. My model works fine on CPU but within 100 iterations of mini-batches size 10 I inevitably get NaN values when running on my GPU.

Is it possible this is a problem with my installation of CUDA, CuDNN, of TF? I've tried downloading TF from source to no avail. Has any one had any luck going back to CUDA 7.5 and CuDNN 4?

EDIT: So after a lot of work I found out that this was an error with my code and not with the architecture. Apparently nans can become more prevalent depending on your environment but at its core this seems to be an issue with my model.

@rohankshir

This comment has been minimized.

Show comment
Hide comment
@rohankshir

rohankshir Apr 2, 2017

I am also getting the issue when I add regularization (for an attention layer). I played with kernel regularization and activity regularization and they are both resulting in nan's. I get nan on both GPU and CPU training

rohankshir commented Apr 2, 2017

I am also getting the issue when I add regularization (for an attention layer). I played with kernel regularization and activity regularization and they are both resulting in nan's. I get nan on both GPU and CPU training

@GZuin

This comment has been minimized.

Show comment
Hide comment
@GZuin

GZuin Apr 14, 2017

Also having this issue on GeForce GTX 1060. Training on CPU works OK, on GPU loss becomes nan after the first batch update.


Tried multiple versions of cuDNN (all of them were 5005 or more recent though), theano (0.8.something and 0.9.0) and keras (1.2.something and 2.somethig). All had the same problem.


Tried disabling cuDNN through .theanorc and scaling my loss by 0.00001. Neither solved the issue (although I was still seeing the cudnn 5005 in the 'Using gpu' message ...)

Things worth of note:
- loss becomes nan even if learning rate is set to 0.
- my model is Input-Embedding-CNN-CNN-Dense-Output. If I remove one CNN layer, training works again.


Tried running the same program on a Tesla K40c. Same story.


Tried decreasing the batch_size. So far so good, haven't seen the nan error yet. My batch size before was 250. It made the first loss calculationm (0.0853) and then turned to nan at 500/80000. Now I'm using a batch size of 2 (went to the extreme). Im currently at 1000/80000 without any problems. Will try different batch sizes and find the one that works the best for me.

Again, 500 and 1000 are only the chunks of data processed, This is all within the first epoch


Hope this might help people in the future with the same problem I had.

GZuin commented Apr 14, 2017

Also having this issue on GeForce GTX 1060. Training on CPU works OK, on GPU loss becomes nan after the first batch update.


Tried multiple versions of cuDNN (all of them were 5005 or more recent though), theano (0.8.something and 0.9.0) and keras (1.2.something and 2.somethig). All had the same problem.


Tried disabling cuDNN through .theanorc and scaling my loss by 0.00001. Neither solved the issue (although I was still seeing the cudnn 5005 in the 'Using gpu' message ...)

Things worth of note:
- loss becomes nan even if learning rate is set to 0.
- my model is Input-Embedding-CNN-CNN-Dense-Output. If I remove one CNN layer, training works again.


Tried running the same program on a Tesla K40c. Same story.


Tried decreasing the batch_size. So far so good, haven't seen the nan error yet. My batch size before was 250. It made the first loss calculationm (0.0853) and then turned to nan at 500/80000. Now I'm using a batch size of 2 (went to the extreme). Im currently at 1000/80000 without any problems. Will try different batch sizes and find the one that works the best for me.

Again, 500 and 1000 are only the chunks of data processed, This is all within the first epoch


Hope this might help people in the future with the same problem I had.

@Phylliida

This comment has been minimized.

Show comment
Hide comment
@Phylliida

Phylliida May 15, 2017

I kept having this issue which was annoying because I would train something overnight and in the morning it was nan. I think I fixed it now, I haven't got nans after about a day of training but I'll update my comment if I do.

To fix this, you have to do three things:

Add a very minor bias and weight regularizer to every layer

model.add(Dense(hiddenSize, kernel_regularizer=l2(0.00001), bias_regularizer=l2(0.00001)))

This is so small it won't really affect your training, it will just ensure the weights and biases don't get massive

Next I did

optimizer = optimizers.Adam(clipnorm=1., clipvalue=0.5)

As described above. Finally, I am using crossentropy loss so I changed it to this:

from keras import losses
def constrainedCrossEntropy(ytrue, ypred):
  ypred = T.clip(ypred, 0.0001, 0.99999)
  return losses.categorical_crossentropy(ytrue, ypred)

model.compile(loss=constrainedCrossEntropy, optimizer=optimizer)

Which ensures the values stay in a reasonable range, because if they get too close to 0 or 1 you will get nans

Edit: I had the parameters flipped for my constrainedCrossEntropy function, fixed that now

Phylliida commented May 15, 2017

I kept having this issue which was annoying because I would train something overnight and in the morning it was nan. I think I fixed it now, I haven't got nans after about a day of training but I'll update my comment if I do.

To fix this, you have to do three things:

Add a very minor bias and weight regularizer to every layer

model.add(Dense(hiddenSize, kernel_regularizer=l2(0.00001), bias_regularizer=l2(0.00001)))

This is so small it won't really affect your training, it will just ensure the weights and biases don't get massive

Next I did

optimizer = optimizers.Adam(clipnorm=1., clipvalue=0.5)

As described above. Finally, I am using crossentropy loss so I changed it to this:

from keras import losses
def constrainedCrossEntropy(ytrue, ypred):
  ypred = T.clip(ypred, 0.0001, 0.99999)
  return losses.categorical_crossentropy(ytrue, ypred)

model.compile(loss=constrainedCrossEntropy, optimizer=optimizer)

Which ensures the values stay in a reasonable range, because if they get too close to 0 or 1 you will get nans

Edit: I had the parameters flipped for my constrainedCrossEntropy function, fixed that now

@kielnino

This comment has been minimized.

Show comment
Hide comment
@kielnino

kielnino May 17, 2017

I had also problems with train or val-loss turning to nan until I realized that my custom loss function was not capable of handling values bigger than 88 (because exp(89) is to big for float32).

from keras import backend as K

def binary_regression_error(y_true, y_pred):
    return K.mean(K.log(1 + K.exp(K.clip(-y_true*y_pred, -1e40, 88.))))

So clipping solved it for me.

kielnino commented May 17, 2017

I had also problems with train or val-loss turning to nan until I realized that my custom loss function was not capable of handling values bigger than 88 (because exp(89) is to big for float32).

from keras import backend as K

def binary_regression_error(y_true, y_pred):
    return K.mean(K.log(1 + K.exp(K.clip(-y_true*y_pred, -1e40, 88.))))

So clipping solved it for me.

@dupsys

This comment has been minimized.

Show comment
Hide comment
@dupsys

dupsys May 23, 2017

Hi guys,
I don't know what to do anymore. I have all the solution given above but I still experience loss: nan and accuracy: nan with the very small batch of size:50. I am using GeForce GTX 680 with CuDNN version 5105. Below is the error with Tensorflow backend:
35/50 [====================>.........] - ETA: 13s - loss: 9.4006 - acc: 0.1041{'acc': 0.10833333, 'loss': 9.4530907, 'batch': 35, 'size': 32}
36/50 [====================>.........] - ETA: 12s - loss: 9.4020 - acc: 0.1043
i:96
{'acc': 0.110625, 'loss': 9.3898754, 'batch': 36, 'size': 32}
37/50 [=====================>........] - ETA: 11s - loss: 9.4017 - acc: 0.1044{'acc': 0.10916667, 'loss': 9.2677832, 'batch': 37, 'size': 32}
38/50 [=====================>........] - ETA: 10s - loss: 9.3982 - acc: 0.1046{'acc': 0.11254902, 'loss': 9.3335171, 'batch': 38, 'size': 17}
39/50 [======================>.......] - ETA: 9s - loss: 9.3965 - acc: 0.1048 {'acc': nan, 'loss': nan, 'batch': 39, 'size': 0}
40/50 [=======================>......] - ETA: 8s - loss: nan - acc: nan {'acc': nan, 'loss': nan, 'batch': 40, 'size': 0}
41/50 [=======================>......] - ETA: 7s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 41, 'size': 0}
42/50 [========================>.....] - ETA: 6s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 42, 'size': 0}
43/50 [========================>.....] - ETA: 5s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 43, 'size': 0}
44/50 [=========================>....] - ETA: 4s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 44, 'size': 0}
45/50 [==========================>...] - ETA: 3s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 45, 'size': 0}

I change the regularisation and customised loss function as follows:
def constrainedCrossEntropy(x, y):
x = T.clip(x, 0.0001, 0.99999)
return losses.categorical_crossentropy(x, y)
#Model
l_conv1 = Conv1D(filters, filter_length=filter_sizes[0], strides=1, padding='same', activation='relu',
kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001), input_shape=(seq_leng,VOCAB_SIZE))(inputs)
l_pool1 = MaxPooling1D(pool_size=pooling_size,padding='same')(l_conv1)
l_conv2 = Conv1D(filters, filter_length=filter_sizes[0], strides=1, padding='same', activation='relu',
kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001)(l_pool1)
l_pool2 = MaxPooling1D(pool_size=pooling_size, padding='same')(l_conv2)

l_conv3 = Conv1D(filters, filter_length=filter_sizes[1], strides=1, padding='same', activation='relu',
kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001))(l_pool2)
l_conv4 = Conv1D(filters, filter_length=filter_sizes[1], strides=1, padding='same', activation='relu',
kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001))(l_conv3)

Please advise me on what to do. Thanks

dupsys commented May 23, 2017

Hi guys,
I don't know what to do anymore. I have all the solution given above but I still experience loss: nan and accuracy: nan with the very small batch of size:50. I am using GeForce GTX 680 with CuDNN version 5105. Below is the error with Tensorflow backend:
35/50 [====================>.........] - ETA: 13s - loss: 9.4006 - acc: 0.1041{'acc': 0.10833333, 'loss': 9.4530907, 'batch': 35, 'size': 32}
36/50 [====================>.........] - ETA: 12s - loss: 9.4020 - acc: 0.1043
i:96
{'acc': 0.110625, 'loss': 9.3898754, 'batch': 36, 'size': 32}
37/50 [=====================>........] - ETA: 11s - loss: 9.4017 - acc: 0.1044{'acc': 0.10916667, 'loss': 9.2677832, 'batch': 37, 'size': 32}
38/50 [=====================>........] - ETA: 10s - loss: 9.3982 - acc: 0.1046{'acc': 0.11254902, 'loss': 9.3335171, 'batch': 38, 'size': 17}
39/50 [======================>.......] - ETA: 9s - loss: 9.3965 - acc: 0.1048 {'acc': nan, 'loss': nan, 'batch': 39, 'size': 0}
40/50 [=======================>......] - ETA: 8s - loss: nan - acc: nan {'acc': nan, 'loss': nan, 'batch': 40, 'size': 0}
41/50 [=======================>......] - ETA: 7s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 41, 'size': 0}
42/50 [========================>.....] - ETA: 6s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 42, 'size': 0}
43/50 [========================>.....] - ETA: 5s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 43, 'size': 0}
44/50 [=========================>....] - ETA: 4s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 44, 'size': 0}
45/50 [==========================>...] - ETA: 3s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 45, 'size': 0}

I change the regularisation and customised loss function as follows:
def constrainedCrossEntropy(x, y):
x = T.clip(x, 0.0001, 0.99999)
return losses.categorical_crossentropy(x, y)
#Model
l_conv1 = Conv1D(filters, filter_length=filter_sizes[0], strides=1, padding='same', activation='relu',
kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001), input_shape=(seq_leng,VOCAB_SIZE))(inputs)
l_pool1 = MaxPooling1D(pool_size=pooling_size,padding='same')(l_conv1)
l_conv2 = Conv1D(filters, filter_length=filter_sizes[0], strides=1, padding='same', activation='relu',
kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001)(l_pool1)
l_pool2 = MaxPooling1D(pool_size=pooling_size, padding='same')(l_conv2)

l_conv3 = Conv1D(filters, filter_length=filter_sizes[1], strides=1, padding='same', activation='relu',
kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001))(l_pool2)
l_conv4 = Conv1D(filters, filter_length=filter_sizes[1], strides=1, padding='same', activation='relu',
kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001))(l_conv3)

Please advise me on what to do. Thanks

@MaratZakirov

This comment has been minimized.

Show comment
Hide comment
@MaratZakirov

MaratZakirov Jun 26, 2017

@hr0nix As far as I know, it's the combination of relu and softmax that causes numerical troubles, as relu can produce large positive values corresponding to very small probabilities. If you change your model to use, say, tanh instead of relu for the last dense layer, the problem will go away.

Just now I had problem with keras ctc loss function on the top of the softmax, I added one more tanh layer before softmax and NaNs are gone!

MaratZakirov commented Jun 26, 2017

@hr0nix As far as I know, it's the combination of relu and softmax that causes numerical troubles, as relu can produce large positive values corresponding to very small probabilities. If you change your model to use, say, tanh instead of relu for the last dense layer, the problem will go away.

Just now I had problem with keras ctc loss function on the top of the softmax, I added one more tanh layer before softmax and NaNs are gone!

@Phylliida

This comment has been minimized.

Show comment
Hide comment
@Phylliida

Phylliida Jun 26, 2017

Phylliida commented Jun 26, 2017

@varshini24

This comment has been minimized.

Show comment
Hide comment
@varshini24

varshini24 Jul 13, 2017

I also faced the same issue with loss variable showing 'nan' while going deep into the layers.

But, I solved the problem by decaying the learning rate for every epoch.

varshini24 commented Jul 13, 2017

I also faced the same issue with loss variable showing 'nan' while going deep into the layers.

But, I solved the problem by decaying the learning rate for every epoch.

@brunez

This comment has been minimized.

Show comment
Hide comment
@brunez

brunez Jul 27, 2017

I haven't looked deep into it, but I think this might have to do with the presence of zeros at some point.

The reason is a workaround I found, which seems pretty robust so far: I just added a layer with very small Gaussian noise after each of my layers. NaNs no more.

brunez commented Jul 27, 2017

I haven't looked deep into it, but I think this might have to do with the presence of zeros at some point.

The reason is a workaround I found, which seems pretty robust so far: I just added a layer with very small Gaussian noise after each of my layers. NaNs no more.

@sun-peach

This comment has been minimized.

Show comment
Hide comment
@sun-peach

sun-peach Aug 7, 2017

I also had this problem recently. I have tried loss clip, weight constraint, adding regularizer with small value. None one works. I am doing regression problem and use cuDNN and float64. I use Adam (tried RMSprop, still have this problem).

BTW, I do not control the last layer (the linear layer), Will that be a problem?

sun-peach commented Aug 7, 2017

I also had this problem recently. I have tried loss clip, weight constraint, adding regularizer with small value. None one works. I am doing regression problem and use cuDNN and float64. I use Adam (tried RMSprop, still have this problem).

BTW, I do not control the last layer (the linear layer), Will that be a problem?

@stale

This comment has been minimized.

Show comment
Hide comment
@stale

stale bot Nov 5, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale bot commented Nov 5, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the stale label Nov 5, 2017

@tallestfinder

This comment has been minimized.

Show comment
Hide comment
@tallestfinder

tallestfinder Dec 4, 2017

Same problem with tensorflow cpu and theano gpu backend. However, same network seems to work perfectly fine with tensorflow-gpu backend. Another problem exists with the gpu backend though, i.e., it can't save weights on checkpoints and the entire python environment crashes out completely.

tallestfinder commented Dec 4, 2017

Same problem with tensorflow cpu and theano gpu backend. However, same network seems to work perfectly fine with tensorflow-gpu backend. Another problem exists with the gpu backend though, i.e., it can't save weights on checkpoints and the entire python environment crashes out completely.

@stale stale bot removed the stale label Dec 4, 2017

@cop4587

This comment has been minimized.

Show comment
Hide comment
@cop4587

cop4587 Jan 12, 2018

I encountered the same problem with TensorFlow backend, GTX980ti, on medium size image dataset (60, 60, 1). My loss comprise binary_crossentropy and kl_loss. After applied K.clip to both losses, nan gone.

cop4587 commented Jan 12, 2018

I encountered the same problem with TensorFlow backend, GTX980ti, on medium size image dataset (60, 60, 1). My loss comprise binary_crossentropy and kl_loss. After applied K.clip to both losses, nan gone.

lagerros added a commit to lagerros/Alpha-Santorini that referenced this issue Jan 21, 2018

Capped relu at 6
Trying to solve why NaN losses occur when running on Azure GPUs (same issue as here: keras-team/keras#1244)...
@coleclayman

This comment has been minimized.

Show comment
Hide comment
@coleclayman

coleclayman Mar 8, 2018

I tried every suggestion on this page and many others to no avail. We were importing csv files with pandas, then using keras Tokenizer with text input to create vocabularies and word vector matrices. After noticing some CSV files led to nan while others worked, suddenly we looked at the encoding of the files and realized that ascii files were NOT working with keras, leading to nan loss and accuracy of 0.0000e+00; however, utf-8 and utf-16 files were working! Breakthrough.

If you're performing textual analysis and getting nan loss after trying these suggestions, use file -i {input} (linux) or file -I {input} (osx) to discover your file type. If you have ISO-8859-1 or us-ascii, try converting to utf-8 or utf-16le. Haven't tried the latter but I'd imagine it would work as well. Hopefully this helps someone very very frustrated!

coleclayman commented Mar 8, 2018

I tried every suggestion on this page and many others to no avail. We were importing csv files with pandas, then using keras Tokenizer with text input to create vocabularies and word vector matrices. After noticing some CSV files led to nan while others worked, suddenly we looked at the encoding of the files and realized that ascii files were NOT working with keras, leading to nan loss and accuracy of 0.0000e+00; however, utf-8 and utf-16 files were working! Breakthrough.

If you're performing textual analysis and getting nan loss after trying these suggestions, use file -i {input} (linux) or file -I {input} (osx) to discover your file type. If you have ISO-8859-1 or us-ascii, try converting to utf-8 or utf-16le. Haven't tried the latter but I'd imagine it would work as well. Hopefully this helps someone very very frustrated!

@jdhurwitz

This comment has been minimized.

Show comment
Hide comment
@jdhurwitz

jdhurwitz Mar 21, 2018

training on CPU, have tried everything on this page. Inputs are fine; used np.nan_to_num on literally every input...

jdhurwitz commented Mar 21, 2018

training on CPU, have tried everything on this page. Inputs are fine; used np.nan_to_num on literally every input...

@colpain

This comment has been minimized.

Show comment
Hide comment
@colpain

colpain Apr 6, 2018

There are multiple causes of this problem, what happened to me is my input dimension doesn't make indices in sparse_categorial_crossentropy, but when I run it on GPU it doesn't throw an error where running on cpu it throw output dimension doesnt make error.
That causes GPU nan loss error in my case

colpain commented Apr 6, 2018

There are multiple causes of this problem, what happened to me is my input dimension doesn't make indices in sparse_categorial_crossentropy, but when I run it on GPU it doesn't throw an error where running on cpu it throw output dimension doesnt make error.
That causes GPU nan loss error in my case

@HuangBo-Terraloupe

This comment has been minimized.

Show comment
Hide comment
@HuangBo-Terraloupe

HuangBo-Terraloupe Apr 13, 2018

Hello guys,
I got nan loss when training lstm 1D and 2D:
I already tried reduce the learning rate, add regularization and also make sure there is no NAN in training and validation set, but the loss is always NAN.

loss function kld

def kl_divergence(labels, prediction, epsilon=1e-7):

prediction /= (tf.reduce_sum(prediction, axis=1, keep_dims=True) + epsilon)
labels /= (tf.reduce_sum(labels, axis=1, keep_dims=True) + epsilon)
result = tf.reduce_mean(tf.reduce_sum(labels * tf.log((labels / (prediction + epsilon)) + epsilon), axis=1))
return result

2D lstm model

def lstm_model_2D(time_step, input_size, output_size):

model = Sequential()
model.add(ConvLSTM2D(8, (3, 3), strides=(1, 1), padding='same', data_format=None, dilation_rate=(1, 1),
                     activation='relu', recurrent_activation='hard_sigmoid', use_bias=True,
                     kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
                     bias_initializer='zeros',
                     unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None,
                     bias_regularizer=None,
                     activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
                     bias_constraint=None,
                     return_sequences=True, go_backwards=False, stateful=False, dropout=0.0, recurrent_dropout=0.0,
                     input_shape=(time_step, ) + input_size))

model.add(ConvLSTM2D(4, (3, 3), strides=(1, 1), padding='same', data_format=None, dilation_rate=(1, 1),
                     activation='relu', recurrent_activation='hard_sigmoid', use_bias=True,
                     kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
                     bias_initializer='zeros',
                     unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None,
                     bias_regularizer=None,
                     activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
                     bias_constraint=None,
                     return_sequences=True, go_backwards=False, stateful=False, dropout=0.0, recurrent_dropout=0.0))

model.add(ConvLSTM2D(1, (3, 3), strides=(1, 1), padding='same', data_format=None, dilation_rate=(1, 1),
                     activation='relu', recurrent_activation='hard_sigmoid', use_bias=True,
                     kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
                     bias_initializer='zeros',
                     unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None,
                     bias_regularizer=None,
                     activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
                     bias_constraint=None,
                     return_sequences=True, go_backwards=False, stateful=False, dropout=0.0, recurrent_dropout=0.0))
model.add(TimeDistributed(Flatten()))
# add output layer
model.add(TimeDistributed(Dense(output_size)))
return model

1D lstm model

def lstm_model_1D(time_step, input_size, output_size):

model = Sequential()
model.add(TimeDistributed(Conv2D(8, (3, 3), activation='relu', padding='same', name='conv1'),
                          input_shape=(time_step, input_size[0], input_size[1], input_size[2])))
model.add(TimeDistributed(Conv2D(1, (3, 3), activation='relu', padding='same', name='conv1')))
model.add(TimeDistributed(Flatten()))
# build a LSTM RNN
model.add(LSTM(
    batch_input_shape=(None, time_step, input_size[0] * input_size[1]),  # Or: input_dim=INPUT_SIZE, input_length=TIME_STEPS,
    output_dim=256,
    return_sequences=True,  # True: output at all steps. False: output as last step.
    activation='relu',
    bias_initializer='RandomUniform',
    dropout=0.5
))
model.add(LSTM(
    batch_input_shape=(None, time_step, input_size[0] * input_size[1]),  # Or: input_dim=INPUT_SIZE, input_length=TIME_STEPS,
    output_dim=256,
    return_sequences=True,  # True: output at all steps. False: output as last step.
    activation='relu',
    bias_initializer='RandomUniform',
    dropout=0.5
))
# add output layer
model.add(TimeDistributed(Dense(output_size)))
return model

HuangBo-Terraloupe commented Apr 13, 2018

Hello guys,
I got nan loss when training lstm 1D and 2D:
I already tried reduce the learning rate, add regularization and also make sure there is no NAN in training and validation set, but the loss is always NAN.

loss function kld

def kl_divergence(labels, prediction, epsilon=1e-7):

prediction /= (tf.reduce_sum(prediction, axis=1, keep_dims=True) + epsilon)
labels /= (tf.reduce_sum(labels, axis=1, keep_dims=True) + epsilon)
result = tf.reduce_mean(tf.reduce_sum(labels * tf.log((labels / (prediction + epsilon)) + epsilon), axis=1))
return result

2D lstm model

def lstm_model_2D(time_step, input_size, output_size):

model = Sequential()
model.add(ConvLSTM2D(8, (3, 3), strides=(1, 1), padding='same', data_format=None, dilation_rate=(1, 1),
                     activation='relu', recurrent_activation='hard_sigmoid', use_bias=True,
                     kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
                     bias_initializer='zeros',
                     unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None,
                     bias_regularizer=None,
                     activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
                     bias_constraint=None,
                     return_sequences=True, go_backwards=False, stateful=False, dropout=0.0, recurrent_dropout=0.0,
                     input_shape=(time_step, ) + input_size))

model.add(ConvLSTM2D(4, (3, 3), strides=(1, 1), padding='same', data_format=None, dilation_rate=(1, 1),
                     activation='relu', recurrent_activation='hard_sigmoid', use_bias=True,
                     kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
                     bias_initializer='zeros',
                     unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None,
                     bias_regularizer=None,
                     activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
                     bias_constraint=None,
                     return_sequences=True, go_backwards=False, stateful=False, dropout=0.0, recurrent_dropout=0.0))

model.add(ConvLSTM2D(1, (3, 3), strides=(1, 1), padding='same', data_format=None, dilation_rate=(1, 1),
                     activation='relu', recurrent_activation='hard_sigmoid', use_bias=True,
                     kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
                     bias_initializer='zeros',
                     unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None,
                     bias_regularizer=None,
                     activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
                     bias_constraint=None,
                     return_sequences=True, go_backwards=False, stateful=False, dropout=0.0, recurrent_dropout=0.0))
model.add(TimeDistributed(Flatten()))
# add output layer
model.add(TimeDistributed(Dense(output_size)))
return model

1D lstm model

def lstm_model_1D(time_step, input_size, output_size):

model = Sequential()
model.add(TimeDistributed(Conv2D(8, (3, 3), activation='relu', padding='same', name='conv1'),
                          input_shape=(time_step, input_size[0], input_size[1], input_size[2])))
model.add(TimeDistributed(Conv2D(1, (3, 3), activation='relu', padding='same', name='conv1')))
model.add(TimeDistributed(Flatten()))
# build a LSTM RNN
model.add(LSTM(
    batch_input_shape=(None, time_step, input_size[0] * input_size[1]),  # Or: input_dim=INPUT_SIZE, input_length=TIME_STEPS,
    output_dim=256,
    return_sequences=True,  # True: output at all steps. False: output as last step.
    activation='relu',
    bias_initializer='RandomUniform',
    dropout=0.5
))
model.add(LSTM(
    batch_input_shape=(None, time_step, input_size[0] * input_size[1]),  # Or: input_dim=INPUT_SIZE, input_length=TIME_STEPS,
    output_dim=256,
    return_sequences=True,  # True: output at all steps. False: output as last step.
    activation='relu',
    bias_initializer='RandomUniform',
    dropout=0.5
))
# add output layer
model.add(TimeDistributed(Dense(output_size)))
return model
@colpain

This comment has been minimized.

Show comment
Hide comment
@colpain

colpain Apr 13, 2018

colpain commented Apr 13, 2018

@swang423

This comment has been minimized.

Show comment
Hide comment
@swang423

swang423 Jul 19, 2018

If you have custom layers, check for possible over/underflow with exp/log. For example, I modified Softmax to include a temperature term, but forgot to subtract max off first for numerical stability.

swang423 commented Jul 19, 2018

If you have custom layers, check for possible over/underflow with exp/log. For example, I modified Softmax to include a temperature term, but forgot to subtract max off first for numerical stability.

@cchung100m

This comment has been minimized.

Show comment
Hide comment
@cchung100m

cchung100m Jul 30, 2018

Dear all,

I encountered same problem on my laptop with MacBook Pro (Retina, 15-inch, Late 2013) Intel Iris Pro 1536 MB graphic card! Problem solved by disabling gpu configuration.

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))

My keras simple NN code:

def basic_model_1(x_size, y_size):
t_model = Sequential()
t_model.add(Dense(100, activation="tanh", input_shape=(x_size,)))
t_model.add(Dense(50, activation="relu"))
t_model.add(Dense(y_size))
print(t_model.summary())
t_model.compile(loss='mean_squared_error',
optimizer=Adam(),
metrics=[metrics.mae])
return(t_model)

model = basic_model_1(arr_x_train.shape[1], arr_y_train.shape[1])

%%time
history = model.fit(arr_x_train, arr_y_train,
batch_size=128,
epochs=500,
shuffle=True,
verbose=1,
validation_data=(arr_x_valid, arr_y_valid),
callbacks=[EarlyStopping(monitor='val_loss', patience=20)])

cchung100m commented Jul 30, 2018

Dear all,

I encountered same problem on my laptop with MacBook Pro (Retina, 15-inch, Late 2013) Intel Iris Pro 1536 MB graphic card! Problem solved by disabling gpu configuration.

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))

My keras simple NN code:

def basic_model_1(x_size, y_size):
t_model = Sequential()
t_model.add(Dense(100, activation="tanh", input_shape=(x_size,)))
t_model.add(Dense(50, activation="relu"))
t_model.add(Dense(y_size))
print(t_model.summary())
t_model.compile(loss='mean_squared_error',
optimizer=Adam(),
metrics=[metrics.mae])
return(t_model)

model = basic_model_1(arr_x_train.shape[1], arr_y_train.shape[1])

%%time
history = model.fit(arr_x_train, arr_y_train,
batch_size=128,
epochs=500,
shuffle=True,
verbose=1,
validation_data=(arr_x_valid, arr_y_valid),
callbacks=[EarlyStopping(monitor='val_loss', patience=20)])

@veqtor

This comment has been minimized.

Show comment
Hide comment
@veqtor

veqtor Aug 2, 2018

For future reference:
I had this problem with WGAN-GP network with some GRU layers, problem went away after adding clip_norm=1.0 to optimizer and removing any data containing non-finite numbers from dataset.
Lesson learned:
Always use clip_norm when you have any recurrent layers

edit:
Don't set clip_norm to 1!! Set it to some reasonable value like 5-6 if you have this problem, maybe even higher! Clip norm is a threshold where lower means a higher effect of clipping, so setting it low like 1 means a drastic effect and likely very strong vanishing gradients

I ended up with a clip_norm value of 3.0 that seemed to fix nan loss without gradient vanishing completely

veqtor commented Aug 2, 2018

For future reference:
I had this problem with WGAN-GP network with some GRU layers, problem went away after adding clip_norm=1.0 to optimizer and removing any data containing non-finite numbers from dataset.
Lesson learned:
Always use clip_norm when you have any recurrent layers

edit:
Don't set clip_norm to 1!! Set it to some reasonable value like 5-6 if you have this problem, maybe even higher! Clip norm is a threshold where lower means a higher effect of clipping, so setting it low like 1 means a drastic effect and likely very strong vanishing gradients

I ended up with a clip_norm value of 3.0 that seemed to fix nan loss without gradient vanishing completely

@marquettec

This comment has been minimized.

Show comment
Hide comment
@marquettec

marquettec Aug 17, 2018

Did you try to compile your model using the 'adam' optimizer?
Worked for me, as explained on this post https://stackoverflow.com/questions/37232782/nan-loss-when-training-regression-network

marquettec commented Aug 17, 2018

Did you try to compile your model using the 'adam' optimizer?
Worked for me, as explained on this post https://stackoverflow.com/questions/37232782/nan-loss-when-training-regression-network

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment