Cost is Nan after one epoch if maxlen > 50 #60

ethancaballero · 2016-09-25T09:07:15Z

For some reason, the cost is Nan after one epoch if I increase 'maxlen' parameter to any number greater than 50.

➜  session2 git:(master) ✗ THEANO_FLAGS=floatX=float32 python train_nmt.py
WARNING (theano.configdefaults): Only clang++ is supported. With g++, we end up with strange g++/OSX bugs.
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
{'use-dropout': [True], 'dim': [1024], 'optimizer': ['adadelta'], 'dim_word': [512], 'reload': [True], 'clip-c': [1.0], 'n-words': [30000], 'model': ['model_hal.npz'], 'learning-rate': [0.0001], 'decay-c': [0.0]}
Reloading model options
Loading data
Building model
Reloading model parameters
Building sampler
Building f_init... Done
Building f_next.. Done
Building f_log_probs... Done
Building f_cost... Done
Computing gradient... Done
Building optimizers... Done
Optimization
Seen 5 samples
NaN detected

setup_local_env.sh was used for setup

The text was updated successfully, but these errors were encountered:

orhanf · 2016-09-25T10:13:15Z

Hi @ethancaballero , i guess you are already applying gradient clipping since clip-c is 1.0 in your config.

I may suggest a few pointers to start with:

please change the optimizer, and start with a very small learning rate (may be adam with 1e-4) to eliminate optimizer related issues.
If this will not solve the problem, please check the output probabilities, and also clip them (to the range [0.0+eps,1.0-eps]), may be the model is inducing very peaky distributions.
please check any exp operation that is performed and make sure that they are stabilized, and not causing any explosion.
you may also consider playing with the truncate_gradient parameter of scan, to give us some more clues

btw is this happening always right after the first epoch, even if you change the random seed, and minibatch ordering?

0bserver07 · 2016-10-04T18:16:39Z

Hey @orhanf, we managed to replicate this error and fix it with @ethancaballero.
For anyone sees this error in future and can't get it resolved with your suggestions, here is what was in our case:

On MacOS X El Capitan 10.11.2, something with memory + cache was causing the Nan to be outputted right after the first or even before first epoch starts.

It got resolved after the updating the OS.

Though few things to keep in mind, the origin code-base has a weird style of Path hard-coded, right about here.
For future reference and if anyone sees this issue, they should check this other issue out and follow something among the lines of that.

orhanf · 2016-10-06T21:37:18Z

Thanks @ethancaballero and @0bserver07

orhanf closed this as completed Oct 6, 2016

amirj mentioned this issue Oct 13, 2016

NaN detected #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cost is Nan after one epoch if maxlen > 50 #60

Cost is Nan after one epoch if maxlen > 50 #60

ethancaballero commented Sep 25, 2016 •

edited

Loading

orhanf commented Sep 25, 2016

0bserver07 commented Oct 4, 2016

orhanf commented Oct 6, 2016

Cost is Nan after one epoch if maxlen > 50 #60

Cost is Nan after one epoch if maxlen > 50 #60

Comments

ethancaballero commented Sep 25, 2016 • edited Loading

orhanf commented Sep 25, 2016

0bserver07 commented Oct 4, 2016

orhanf commented Oct 6, 2016

ethancaballero commented Sep 25, 2016 •

edited

Loading