New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM with Batch Normalization #2183
Conversation
is there a way to make a recurrent batch normalization module that can be used by any type of recurrent layer? Something like a model that gets an RNN layer as input? Or is this too layer specific? Also, how much improvement did you observe in your tests? |
I had to dig into LSTM re-code it a little bit so it is not possible to do it in a generic way without re-writing the existing Recurrent units (which is not a lot of work) I had huge improvement in speed and in the total loss I reached. |
That is pretty good!!! here is an aside: both dropout and batchnorm should be kept the same throughout the entire run. It is something that makes sense. Dropout is meant to drop weights and not activations. Maybe in the future we should look into the way we write RNNs in Keras to support dropout and batchnorm callbacks in a generalized way. This will make it easy to support more complicated and custom RNN types (like neural-gpu and conv-rnns) without having to rewrite them. Any ideas on how to implement RNN callbacks @fchollet, @farizrahman4u ? Can Keras-1 help with that? |
actually the paper Recurrent Batch Normalization said they got better results when different BN parameters are used in each step of the LSTM iteration. This would have complicated the code a lot, add a lot of memory consumption and also the implementation of the paper's author that I found did not do it |
For now I only tested on Theano... which takes ages to compile. surprisingly tensorflow compilation is also slow... |
It's great to have this implementation available, but at the same time I don't think we should have more than one implementation of LSTM in Keras. Either we should figure out how to make BatchNorm part of the current implementation, or we should make the BN LSTM implementation part of a Keras extensions package (like @EderSantana's Seya). |
Ok, I will add it as a parameter to the existing LSTM |
I merged the BN code into LSTM class. I removed some of the usual BN parameters from the interface because I believe there is no need to touch them and they would have made the LSTM API even more complex. This PR also fix a bug that was hidden in several places in the code in which |
In any case, we won't merge anything into Keras 0, the current version is considered final. All new PRs should be made to Keras-1, because that branch will replace master rather than be merged into master. |
Ok, it's too much for me (and my GPU) to switch between two versions of Keras so once I personally switch to Keras-1 I will also make a new PR. We can keep this PR as-is for anyone who wants it. |
9855d8a
to
dbda22d
Compare
Cool, you can just adapt it to Keras 1 (Lstm hasn't changed much) and resubmit it after Keras 1 is officially released. |
Primary author here, just wanted to confirm that we do use the same gamma/beta on all steps. Only the statistics can differ over time, not the parameters. The code looks right to me, though I don't understand how/where Keras manages the updates. Thanks for making this available! |
Hi Tim, |
The part that confuses me is that the updates added to |
looks like I already explained this in the code: |
any progress? |
@udibr I'll try to port your code into keras 1, if you haven't yet. |
@xingdi-eric-yuan I have a working version of LSTM BN on a branch called mymaster https://github.com/udibr/keras/blob/mymaster/keras/layers/recurrent.py#L583 The version in the original PR (for Keras 0.3) is incorrect. As @cooijmanstim hinted we are updating the mean/var at every step of the LSTM and if you want to have a generic solution you should somehow allow the step function to accept the mean/var as input to each step and give the updated values as output of each step. My new code on mymaster branch does this but it is very convoluted code which I am not too proud off and as it is now, I think it really should be outside of Keras (see what @fchollet said about Seya) or at least in separate class from LSTM. Please update me if you look into it and manage to clean it up or of course find bugs. I played with it a little bit on real examples and it looks like the LSTM+BN mechanism slows run time and on the other side cause overfitting. So it is not just turning the batch_norm flag to True. You need to change the model size to get comparable results. Maybe you can end up with a smaller/faster model but I'm not sure. |
@udibr Thanks for pointing it out, I'll look into it 👍 |
I'll make a new pull request, and paste some of the test results on it. |
So what are we going to do with this PR, and with BN for RNNs in general?
Yes, maybe. I see quite a bit of interest for this feature, coming from advanced users. But do the results delivered justify the interest? For users, is the runtime performance cost justified by superior learning performance? And for us developers, is the added codebase complexity and maintenance cost justified by the usefulness of the feature? |
I now think it should be outside. Also I tried to merge my working version (on mymaster branch) to this pull request branch so that other users will not accidently start working from the wrong version but there were few changes in the master LSTM and so now some real merge work should be done. Also as @xingdi-eric-yuan found out some work is needed to make it work on tf. |
@fchollet -- what about a second repo, one where things are a little more 'wild west', but solutions like this are integrated. I see many comments around the slack/google group/github issues that indicate people are recreating many wheels with keras. for example, I have a decent soft attention mechanism and I've seen at least 3 separate people also implement their own. and it wouldn't be without precedent. Lasagne has Recipes. Blocks has blocks-extras. caffe has a variety, such as nlpcaffe. I'd be willing at least assist with PRs provided that I didn't have to maintain API consistency the way you are (excellently) doing with keras. given wild-west status, I think it might be nice. |
To avoid future confusion I moved my code back to my lstmbn branch of my code which was used in this PR. HOWEVER, I did not run any tests on |
@braingineer the problem with a There were several attempts to create |
@udibr that makes sense. I think if they were the combined personal libraries of some of the more prolific keras users, it would have a higher chance of being maintained. Plus, the wild-west status means not having to guarantee everything works. It could come with the academic guarantee, so to speak ('it worked for me once, but I give no promises after that'). Even having it be in a central place without compatibility with some new version would be useful. someone could come along and decide to port it, as a way to contribute. I think, though, minimally, there should at least be an index of personal keras libraries and it should be pointed at by keras master (by a link or, more generously, a page in the docs). right now, there is a loose set of repositories and gists that you have know about in order to find (for example, seya, seqtoseq, some vgg ports, a keras branch that converts caffe models to keras models, attention mechanisms, etc). |
@braingineer How about this wiki page https://github.com/fchollet/keras/wiki/Built-with-Keras |
That's awesome. I would consider it the minimum level though. Also, did not realize there was a wiki, so it could use some more sign posting for it. It's not mentioned anywhere (I don't think). I feel that something like the awesome series would be a better minimum (aka https://github.com/kjw0612/awesome-rnn) and it should/could have a blurb about it (a natural place would be near the mention of the examples page where it says |
@barvinograd @udibr Hi, does this work now? Thanks! |
Has this been updated for Keras 1.x? It'd be very useful for what I'm working on right now. |
Has anyone tried this with sampling ? I was unable to make this work in sampling mode. Any pointers ? |
@fchollet @braingineer @udibr @xingdi-eric-yuan What was the group decision on whether BN for RNN belongs in Keras? |
@udibr Why was your final conclusion that your LSTMBN class should not be merged in to Keras? Further questions:
Thanks, -- Freddy Snijder |
@udibr Have you tried this method? I have had some experiments, Using population statistics (training: False) at test time gives worse results than batch statistics. I don't know why? |
Simplified LSTM with Batch Normalization from the paper Recurrent Batch Normalization.
The main simplification is that the same gamma is used on all steps.
This PR is for Keras-0. I will merge to Keras-1 once its out of preview