Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Recurrent Batch Normalization #163

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

iassael
Copy link

@iassael iassael commented Apr 16, 2016

Following the implementation of Recurrent Batch Normalization http://arxiv.org/abs/1603.09025, the code implements Batch-Normalized LSTMs.

@karpathy
Copy link
Owner

Thanks! Curious - have you tested if this works better?

@iassael
Copy link
Author

iassael commented Apr 16, 2016

I had the same question, and I just deployed it to our servers. I'll come back with more results!
Thank you!

@iassael
Copy link
Author

iassael commented Apr 16, 2016

Here are the validation scores for LSTM and BN-LSTM using the default options.

BN-LSTM trains faster but without dropout it tends to overfit faster as well.

@windweller
Copy link

Hey @iassael did you have different mean/variance for each timestep? Or a shared mean/variance over all timesteps of one batch? The paper said " Consequently, we recommend using separate statistics for each timestep to preserve information of the initial transient phase in the activations.".

@iassael
Copy link
Author

iassael commented Apr 16, 2016

UPDATE: Check my reply below.

Hi @windweller you are right. In this case, following the current project structure, the statistics were calculated overall.

@iassael
Copy link
Author

iassael commented Apr 17, 2016

@windweller, looking at the implementation of nn.BatchNormalization, the running_mean and running_var, variables are not part of the parameters vector as they are not trainable.

Therefore, even when we the proto.rnn is cloned, each nn.BatchNormalization layer of each clone keeps its own statistics (running_mean and running_var).

Hence, the implementation is acting as recommended in the paper.

Thank you for pointing it out!

@fmassa
Copy link

fmassa commented Apr 17, 2016

Quick note: there is no need to implement LinearNB, as the no-bias functionality was integrated in nn already torch/nn#583

@karpathy
Copy link
Owner

karpathy commented Apr 19, 2016

Can I ask what the motivation is for removing biases from that linear layer? (haven't read the BN LSTM papers yet). Is this just to avoid redundancy? Also, is it a big deal if this wasn't done? Also, is this code fully backwards compatible and identical in functionality? And how would the code behave if someone has an older version of torch that does not have the LinearNB patch?

EDIT: e.g. it seems to me that due to the additional , false in one of the nn.Linears this code is not backwards compatible and does not behave identically. Although, I think it should be fine because the xtoh pathway already has biases?

@iassael
Copy link
Author

iassael commented Apr 19, 2016

Hi @karpathy, the motivation is exactly to avoid redundancy. This saves 2*rnn_size parameters. In our case it is the 256 / 239297 (~0.1%) of the model's parameters (default settings), which is not significant, and therefore, it could be ignored.

In terms of backward compatibility, a redundant parameter passed to a function in Lua is ignored. Therefore, although the layer would have slightly different behavior, it should still maintain backward compatibility, and in both cases, it should work perfectly.

A simple example is the following:

function test(a,b) print(a,b) end
test(1,2,3)
> 1, 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants