Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM with Batch Normalization #2183

Closed
wants to merge 9 commits into from
Closed

LSTM with Batch Normalization #2183

wants to merge 9 commits into from

Conversation

udibr
Copy link
Contributor

@udibr udibr commented Apr 4, 2016

Simplified LSTM with Batch Normalization from the paper Recurrent Batch Normalization.
The main simplification is that the same gamma is used on all steps.

This PR is for Keras-0. I will merge to Keras-1 once its out of preview

@EderSantana
Copy link
Contributor

is there a way to make a recurrent batch normalization module that can be used by any type of recurrent layer? Something like a model that gets an RNN layer as input? Or is this too layer specific?

Also, how much improvement did you observe in your tests?

@udibr
Copy link
Contributor Author

udibr commented Apr 4, 2016

I had to dig into LSTM re-code it a little bit so it is not possible to do it in a generic way without re-writing the existing Recurrent units (which is not a lot of work)

I had huge improvement in speed and in the total loss I reached.
doing more tests....

@EderSantana
Copy link
Contributor

That is pretty good!!!


here is an aside: both dropout and batchnorm should be kept the same throughout the entire run. It is something that makes sense. Dropout is meant to drop weights and not activations.

Maybe in the future we should look into the way we write RNNs in Keras to support dropout and batchnorm callbacks in a generalized way. This will make it easy to support more complicated and custom RNN types (like neural-gpu and conv-rnns) without having to rewrite them.

Any ideas on how to implement RNN callbacks @fchollet, @farizrahman4u ? Can Keras-1 help with that?

@udibr
Copy link
Contributor Author

udibr commented Apr 4, 2016

actually the paper Recurrent Batch Normalization said they got better results when different BN parameters are used in each step of the LSTM iteration. This would have complicated the code a lot, add a lot of memory consumption and also the implementation of the paper's author that I found did not do it
https://github.com/cooijmanstim/Attentive_reader/blob/bn/codes/att_reader/layers.py#L33

@udibr
Copy link
Contributor Author

udibr commented Apr 4, 2016

For now I only tested on Theano... which takes ages to compile.

surprisingly tensorflow compilation is also slow...

@fchollet
Copy link
Member

fchollet commented Apr 4, 2016

It's great to have this implementation available, but at the same time I don't think we should have more than one implementation of LSTM in Keras. Either we should figure out how to make BatchNorm part of the current implementation, or we should make the BN LSTM implementation part of a Keras extensions package (like @EderSantana's Seya).

@udibr
Copy link
Contributor Author

udibr commented Apr 4, 2016

Ok, I will add it as a parameter to the existing LSTM

@udibr
Copy link
Contributor Author

udibr commented Apr 4, 2016

I merged the BN code into LSTM class. I removed some of the usual BN parameters from the interface because I believe there is no need to touch them and they would have made the LSTM API even more complex.

This PR also fix a bug that was hidden in several places in the code in which nb_param in set_weights method was computed directly (and wrongly) instead of using len(self.get_weights()) or something similar. This bug showed up only when you start to merge models that have non_trainable_weights such as BatchNormalization and now LSTM and I had to fix it to complete my tests.

@fchollet
Copy link
Member

fchollet commented Apr 4, 2016

In any case, we won't merge anything into Keras 0, the current version is considered final. All new PRs should be made to Keras-1, because that branch will replace master rather than be merged into master.

@udibr
Copy link
Contributor Author

udibr commented Apr 5, 2016

Ok, it's too much for me (and my GPU) to switch between two versions of Keras so once I personally switch to Keras-1 I will also make a new PR. We can keep this PR as-is for anyone who wants it.

@udibr udibr force-pushed the lstmbn branch 2 times, most recently from 9855d8a to dbda22d Compare April 5, 2016 00:39
@fchollet
Copy link
Member

fchollet commented Apr 5, 2016

Cool, you can just adapt it to Keras 1 (Lstm hasn't changed much) and resubmit it after Keras 1 is officially released.

@cooijmanstim
Copy link

Primary author here, just wanted to confirm that we do use the same gamma/beta on all steps. Only the statistics can differ over time, not the parameters.

The code looks right to me, though I don't understand how/where Keras manages the updates. Thanks for making this available!

@udibr
Copy link
Contributor Author

udibr commented Apr 6, 2016

Hi Tim,
If you mean how gamma is being updated? this is eventually done through theano's mechanism of updating shared variables. http://deeplearning.net/software/theano/tutorial/examples.html#using-shared-variables

@cooijmanstim
Copy link

The part that confuses me is that the updates added to self.updates in the bn method use variables from the inner graph of the scan op. Somewhere along the line these have to be translated into the corresponding outer variables, but I don't see where this is done.

@udibr
Copy link
Contributor Author

udibr commented Apr 6, 2016

looks like I already explained this in the code:
https://github.com/udibr/keras/blob/lstmbn/keras/layers/recurrent.py#L171-L175

@henry0312
Copy link
Contributor

any progress?

@xingdi-eric-yuan
Copy link
Contributor

@udibr I'll try to port your code into keras 1, if you haven't yet.

@udibr
Copy link
Contributor Author

udibr commented May 6, 2016

@xingdi-eric-yuan I have a working version of LSTM BN on a branch called mymaster https://github.com/udibr/keras/blob/mymaster/keras/layers/recurrent.py#L583

The version in the original PR (for Keras 0.3) is incorrect. As @cooijmanstim hinted we are updating the mean/var at every step of the LSTM and if you want to have a generic solution you should somehow allow the step function to accept the mean/var as input to each step and give the updated values as output of each step. My new code on mymaster branch does this but it is very convoluted code which I am not too proud off and as it is now, I think it really should be outside of Keras (see what @fchollet said about Seya) or at least in separate class from LSTM. Please update me if you look into it and manage to clean it up or of course find bugs.

I played with it a little bit on real examples and it looks like the LSTM+BN mechanism slows run time and on the other side cause overfitting. So it is not just turning the batch_norm flag to True. You need to change the model size to get comparable results. Maybe you can end up with a smaller/faster model but I'm not sure.

@xingdi-eric-yuan
Copy link
Contributor

@udibr Thanks for pointing it out, I'll look into it 👍

@xingdi-eric-yuan
Copy link
Contributor

xingdi-eric-yuan commented May 7, 2016

I'll make a new pull request, and paste some of the test results on it.
#2657

@fchollet
Copy link
Member

So what are we going to do with this PR, and with BN for RNNs in general?

I think it really should be outside of Keras

Yes, maybe. I see quite a bit of interest for this feature, coming from advanced users. But do the results delivered justify the interest? For users, is the runtime performance cost justified by superior learning performance? And for us developers, is the added codebase complexity and maintenance cost justified by the usefulness of the feature?

@udibr
Copy link
Contributor Author

udibr commented May 10, 2016

I now think it should be outside.

Also I tried to merge my working version (on mymaster branch) to this pull request branch so that other users will not accidently start working from the wrong version but there were few changes in the master LSTM and so now some real merge work should be done. Also as @xingdi-eric-yuan found out some work is needed to make it work on tf.

@braingineer
Copy link
Contributor

braingineer commented May 11, 2016

I see quite a bit of interest for this feature, coming from advanced users. But do the results delivered justify the interest? For users, is the runtime performance cost justified by superior learning performance? And for us developers, is the added codebase complexity and maintenance cost justified by the usefulness of the feature?

@fchollet -- what about a second repo, one where things are a little more 'wild west', but solutions like this are integrated. I see many comments around the slack/google group/github issues that indicate people are recreating many wheels with keras. for example, I have a decent soft attention mechanism and I've seen at least 3 separate people also implement their own.

and it wouldn't be without precedent. Lasagne has Recipes. Blocks has blocks-extras. caffe has a variety, such as nlpcaffe. I'd be willing at least assist with PRs provided that I didn't have to maintain API consistency the way you are (excellently) doing with keras. given wild-west status, I think it might be nice.

@udibr
Copy link
Contributor Author

udibr commented May 12, 2016

To avoid future confusion I moved my code back to my lstmbn branch of my code which was used in this PR.
I also put all my changes in a separate class LSTMBN so there will be no conflicts with keras master LSTM which is all the time changing. I also fixed a bug that @xingdi-eric-yuan found (thanks!)

HOWEVER, I did not run any tests on LSTMBN and I closed this PR and I dont think it should be merged with keras/master.

@udibr
Copy link
Contributor Author

udibr commented May 12, 2016

@braingineer the problem with a keras-extras repository is maintenance. Keras is changing all the time (in a good way) but every time it changes someone needs to go back and make sure that all the add-ons continue to work.

There were several attempts to create keras-extras such as Seya and seq2seq but as far as I know currently both repos do not work because of changes in Keras 1.

@braingineer
Copy link
Contributor

braingineer commented May 12, 2016

@udibr that makes sense. I think if they were the combined personal libraries of some of the more prolific keras users, it would have a higher chance of being maintained.

Plus, the wild-west status means not having to guarantee everything works. It could come with the academic guarantee, so to speak ('it worked for me once, but I give no promises after that'). Even having it be in a central place without compatibility with some new version would be useful. someone could come along and decide to port it, as a way to contribute.

I think, though, minimally, there should at least be an index of personal keras libraries and it should be pointed at by keras master (by a link or, more generously, a page in the docs). right now, there is a loose set of repositories and gists that you have know about in order to find (for example, seya, seqtoseq, some vgg ports, a keras branch that converts caffe models to keras models, attention mechanisms, etc).

@udibr
Copy link
Contributor Author

udibr commented May 12, 2016

@braingineer How about this wiki page https://github.com/fchollet/keras/wiki/Built-with-Keras

@braingineer
Copy link
Contributor

That's awesome. I would consider it the minimum level though. Also, did not realize there was a wiki, so it could use some more sign posting for it. It's not mentioned anywhere (I don't think).

I feel that something like the awesome series would be a better minimum (aka https://github.com/kjw0612/awesome-rnn) and it should/could have a blurb about it (a natural place would be near the mention of the examples page where it says In the examples folder of the repository, you will find more advanced models: question-answering with memory networks, text generation with stacked LSTMs, etc.)

@pxlong
Copy link

pxlong commented Jun 28, 2016

@barvinograd @udibr Hi, does this work now? Thanks!

@matthiasplappert
Copy link
Contributor

Has this been updated for Keras 1.x? It'd be very useful for what I'm working on right now.

@anirudh9119
Copy link

Has anyone tried this with sampling ? I was unable to make this work in sampling mode. Any pointers ?

@fchollet fchollet closed this Nov 29, 2016
@dgboy2000
Copy link

@fchollet @braingineer @udibr @xingdi-eric-yuan What was the group decision on whether BN for RNN belongs in Keras?

@visionscaper
Copy link

HOWEVER, I did not run any tests on LSTMBN and I closed this PR and I dont think it should be merged with keras/master.

@udibr Why was your final conclusion that your LSTMBN class should not be merged in to Keras?

Further questions:

  • Your latest LSTMBN implementation is for Keras 1.x correct?
  • What should in your opinion be done to make it Keras 'worthy'?

Thanks,

-- Freddy Snijder

@liangsun-ponyai
Copy link

@udibr Have you tried this method? I have had some experiments, Using population statistics (training: False) at test time gives worse results than batch statistics. I don't know why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet