Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost is Nan after one epoch if maxlen > 50 #60

Closed
ethancaballero opened this issue Sep 25, 2016 · 3 comments
Closed

Cost is Nan after one epoch if maxlen > 50 #60

ethancaballero opened this issue Sep 25, 2016 · 3 comments

Comments

@ethancaballero
Copy link

ethancaballero commented Sep 25, 2016

For some reason, the cost is Nan after one epoch if I increase 'maxlen' parameter to any number greater than 50.

➜  session2 git:(master) ✗ THEANO_FLAGS=floatX=float32 python train_nmt.py
WARNING (theano.configdefaults): Only clang++ is supported. With g++, we end up with strange g++/OSX bugs.
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
{'use-dropout': [True], 'dim': [1024], 'optimizer': ['adadelta'], 'dim_word': [512], 'reload': [True], 'clip-c': [1.0], 'n-words': [30000], 'model': ['model_hal.npz'], 'learning-rate': [0.0001], 'decay-c': [0.0]}
Reloading model options
Loading data
Building model
Reloading model parameters
Building sampler
Building f_init... Done
Building f_next.. Done
Building f_log_probs... Done
Building f_cost... Done
Computing gradient... Done
Building optimizers... Done
Optimization
Seen 5 samples
NaN detected

setup_local_env.sh was used for setup

@orhanf
Copy link
Collaborator

orhanf commented Sep 25, 2016

Hi @ethancaballero , i guess you are already applying gradient clipping since clip-c is 1.0 in your config.

I may suggest a few pointers to start with:

  • please change the optimizer, and start with a very small learning rate (may be adam with 1e-4) to eliminate optimizer related issues.
  • If this will not solve the problem, please check the output probabilities, and also clip them (to the range [0.0+eps,1.0-eps]), may be the model is inducing very peaky distributions.
  • please check any exp operation that is performed and make sure that they are stabilized, and not causing any explosion.
  • you may also consider playing with the truncate_gradient parameter of scan, to give us some more clues

btw is this happening always right after the first epoch, even if you change the random seed, and minibatch ordering?

@0bserver07
Copy link

Hey @orhanf, we managed to replicate this error and fix it with @ethancaballero.
For anyone sees this error in future and can't get it resolved with your suggestions, here is what was in our case:

On MacOS X El Capitan 10.11.2, something with memory + cache was causing the Nan to be outputted right after the first or even before first epoch starts.

It got resolved after the updating the OS.

Though few things to keep in mind, the origin code-base has a weird style of Path hard-coded, right about here.
For future reference and if anyone sees this issue, they should check this other issue out and follow something among the lines of that.

@orhanf
Copy link
Collaborator

orhanf commented Oct 6, 2016

Thanks @ethancaballero and @0bserver07

@orhanf orhanf closed this as completed Oct 6, 2016
@amirj amirj mentioned this issue Oct 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants