BN Questions (old) #1802

jmhessel · 2016-02-23T23:12:14Z

Hey All,

I posted this in keras-users but haven't gotten any replies, but I've been doing some thinking about the keras implementation of Batch Normalization.

The problems with the current implementation are threefold:

A) A batch normalization layer negates the bias term of the previous layer (i.e. BN(Wx+b) = BN(Wx)) which wastes memory and computation. Because all batch normalized activations are shifted by the mean over the batch, any bias term added in from the previous layer will be entirely canceled out. In batch normalized networks, you don't need bias terms.

model.add(Dense(100, activation = 'linear'))
model.add(BatchNormalization())
model.add(Activation('relu'))

is currently wrong, because there are an extra set of bias parameters in the dense layer that are useless.

B) This point was correct, I am mistaken.

C) The API makes it somewhat easy to incorrectly apply BN. BN should be applied before activations are applied, i.e.

model.add(Dense(100, activation = 'relu'))
model.add(BatchNormalization())

is wrong. In addition to having an extraneous bias term, BN should be applied before the activation.

Because it seems BN is here to stay, so to speak, it might be worth re-thinking the batch normalization layer. Having it as its own layer is hard because its behavior should depend on the type of layer before it. There are three possible routes.

One could be having a parameter in the Model class that by default adds batch normalization to each layer, and removes all biases from the network. This might look like

## A correct model
model = Sequential(batch_norm = True)
model.add(Convolution2D(32, 3, 3, border_mode='valid', input_shape=(3, 100, 100)))
model.add(Flatten())
model.add(Dense(100, input_dim=(20,), activation = 'relu'))
model.add(Dense(500, input_dim=(20,), activation = 'relu'))
model.add(Dense(10, input_dim=(20,), activation = 'softmax'))
...

This would make the API clean, but would also require some reworking of layers (i.e. if batch_norm mode was on, the parameters of the layers would not include bias terms)

Another option would be to include it as a property of layers themselves.

## A correct model
model = Sequential()
model.add(Convolution2D(32, 3, 3, border_mode='valid', input_shape=(3, 100, 100),
no_bias=True, batch_norm = True))
model.add(Flatten())
model.add(Dense(100, input_dim=(20,), no_bias = True, batch_norm = True, activation = 'relu'))
model.add(Dense(500, input_dim=(20,), no_bias = True, batch_norm = True, activation = 'relu'))
model.add(Dense(10, input_dim=(20,), batch_norm = True, activation = 'softmax'))
...

This is less clean, but would require less implementation.

The last option is to keep it as its own layer, and it performs inference to figure out what it should be doing, along with warnings about previous layers having biases/activations

## A correct model
model = Sequential()
model.add(Convolution2D(32, 3, 3, border_mode='valid', input_shape=(3, 100, 100), no_bias=True))
model.add(BatchNormalization()) #would perform the correct convolutional batch normalization
model.add(Activation('relu')) #would raise no warnings about post-relu values being BN'ed -- this is right
model.add(Flatten())
model.add(Dense(100, no_bias = True))
model.add(BatchNormalization())
model.add(Activation('relu'))
...

This is the least clean API wise, but easiest to implement, because everything is contained in the BatchNormalization Layer itself.

Thoughts? I'd be happy to work on this, but I don't know which direction would be best.

The text was updated successfully, but these errors were encountered:

fchollet · 2016-02-23T23:22:12Z

Point A is "the fact that biases are bundled with Dense layers adds a negligible amount of overhead in situations where I don't need biases". Well, it's trivial to roll out a Dense layer with no biases if you need that.

Point B is 100% factually wrong.

Point C is wrong, for several reasons, 1) because it's usually preferable to apply BN after relu and 2) because the more common way to apply relu in Keras is via Activation('relu').

Overall: we could consider adding a keyword argument in Dense (and other layers) to make biases optional. But I don't think that is all that useful or necessary.

jmhessel · 2016-02-23T23:30:16Z

Apologies for misreading the code for B -- I guess I don't have the best understanding of how it's implemented. When you say it's factually wrong, do you mean that you shouldn't have a single gamma/beta term for each feature map, and that it's better do do things on a per-activation basis? Or that the current implementation already has this?

Indeed, the computational and memory requirements for a bias term are only linear in the layer size. It would be relatively simple to add a dense layer without bias, sure. If the direction of Keras is agnostic to choices like this, fine -- I just thought I'd bring it up.

For point C -- perhaps I misunderstood the paper, but in section 3.2 they suggest doing g(BN(Wu)) where g is the activation function. Perhaps I am misunderstanding something?

jmhessel · 2016-02-23T23:35:19Z

Ahh -- apologies for point B. I am definitely mistaken.

fchollet · 2016-02-23T23:35:34Z

Keras BN normalizes per feature. The axis on which features are to be found can be configured (with convolutions under Theano conventions it needs to be set to 1. It defaults to -1).

For point C -- perhaps I misunderstood the paper, but in section 3.2 they suggest doing g(BN(Wu)) where g is the activation function. Perhaps I am misunderstanding something?

I haven't gone back to check what they are suggesting in their original paper, but I can guarantee that recent code written by Christian applies relu before BN. It is still occasionally a topic of debate, though.

jmhessel · 2016-02-23T23:43:00Z

Fair enough. If the linear memory/computation isn't something worth addressing and if Christian is using BN before relu (it doesn't really matter because Keras supports both, but, IMO the API biases users towards doing the opposite of the paper) then all is well. Again, sorry for point B.

chbrian · 2016-06-15T20:19:46Z

Here are some evaluations of BatchNormalization. It seems it's better to put BatchNormalization after RELU
https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md

mlzxy · 2016-11-24T20:33:23Z

@jmhessel I do think the original batch normalization paper mentions you should apply BN after activation, instead of before. Well in practice it becomes an experimental thing.

jmhessel · 2016-11-24T21:13:57Z

@benbbear From the original paper

We add the BN transform immediately before the nonlinearity, by normalizing x = Wu+ b. We could have also normalized the layer inputs u, but since u is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift. In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian” Hyvarinen & Oja (2000); normalizing it is likely to produce activations with a stable distribution.

So, the original paper did say you should do it before the activation, hence my original concern with the keras API. I guess this part proved to be less important, though. I thought it might be problematic but didn't do the experiments myself.

mlzxy · 2016-11-24T22:48:24Z

@jmhessel

Just review it again.

Yes I am indeed wrong. Thanks so much for your correction!

martinbel · 2016-11-25T07:06:09Z

The original paper mentions this approach and the He "plain net" implementation also. In He's words from the 2015 resnet paper: "We adopt batch normalization right after each convolution and before activation, following [16]."

A simple example:

model.add(Convolution2D(64, 3, 3, border_mode='same', init=init))
model.add(BatchNormalization(axis=1)) # axis=-1 for theano, -1 for tensorflow
model.add(Activation('relu')) # Could be any other activation

jmhessel · 2016-11-25T07:37:55Z

@martinbel Actually, that is how the ResNet50 in Keras is implemented (in the applications folder). Not that any of this really matters to Keras, specifically; both are possible, but the only difference is that you can't use the activation = ... shortcut in the layer construction if you are applying BN in the originally intended order. Though I am not 100% convinced that either is correct, I am convinced that the difference in performance is marginal and/or just another hyperparameter.

martinbel · 2016-11-25T15:31:28Z

@jmhessel I've added this just to help others looking for a simple, concrete answer to how to implement batch normalization in keras as it's implemented in the paper. The resnet paper quote refers to how He implements the "plain nets", not only resnets. I agree you can't use the activation =... for doing this.

litingsjj · 2017-04-07T01:20:58Z

@martinbel what's mean about "# axis=-1 for theano, -1 for tensorflow" ?

I reference keras.io: "axis: Integer, the axis that should be normalized (typically the features axis). For instance, after a Conv2D layer with data_format="channels_first", set axis=1 in BatchNormalization."
Is it different for theano or tensorflow?

litingsjj · 2017-04-07T07:04:04Z

When I use BatchNormalization() before every activation(Conv), it takes 3656s for one epoch. Without BatchNormalization(), it takes 216s . Is it really need BatchNormalization() about every activation or how to choose BatchNormalization()?

nouiz · 2017-04-12T13:52:11Z

Old Theano batch norm was super efficient and I saw such time of problems. Make sure that you use an up to date version of keras that use the new Theano batch norm.

…

On Fri, Apr 7, 2017 at 3:04 AM litingsjj ***@***.***> wrote: When I use BatchNormalization() before every activation(Conv), it takes 3656s for one epoch. Without BatchNormalization(), it takes 216s . Is it really need BatchNormalization() about every activation or how to choose BatchNormalization()? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1802 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALC--foysR3m0F6sLti8UjjIqT_gLyeks5rtd_mgaJpZM4HhJtl> .

Summary: Change the order of BN with respect to other modules in a layer. Based on this reference: keras-team/keras#1802 (comment) Differential Revision: D49447550 fbshipit-source-id: aa03c18de8a566e5d07659b28459b5d8ecb4c881

jmhessel changed the title ~~Batch Normalization is Wrong~~ Batch Normalization is Wrong? Feb 23, 2016

fchollet closed this as completed Feb 23, 2016

jmhessel changed the title ~~Batch Normalization is Wrong?~~ BN Questions (old) Sep 22, 2016

jrosebr1 mentioned this issue Dec 13, 2016

BatchNorm: Before or after activation? cvjena/cnn-models#3

Closed

andimarafioti mentioned this issue Mar 7, 2018

Use dropout correctly andimarafioti/audioContextEncoder#6

Open

hbertrand mentioned this issue Mar 5, 2019

Minor fixes mila-iqia/dlschool-ivadofr-a18#1

Merged

Astlaan mentioned this issue Jun 6, 2020

Batch normalization is being applied before the activation (opposite of Keras!) keras-team/autokeras#1175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BN Questions (old) #1802

BN Questions (old) #1802

jmhessel commented Feb 23, 2016

fchollet commented Feb 23, 2016

jmhessel commented Feb 23, 2016

jmhessel commented Feb 23, 2016

fchollet commented Feb 23, 2016

jmhessel commented Feb 23, 2016

chbrian commented Jun 15, 2016 •

edited

Loading

mlzxy commented Nov 24, 2016 •

edited

Loading

jmhessel commented Nov 24, 2016 •

edited

Loading

mlzxy commented Nov 24, 2016

martinbel commented Nov 25, 2016

jmhessel commented Nov 25, 2016 •

edited

Loading

martinbel commented Nov 25, 2016 •

edited

Loading

litingsjj commented Apr 7, 2017

litingsjj commented Apr 7, 2017

nouiz commented Apr 12, 2017 via email

BN Questions (old) #1802

BN Questions (old) #1802

Comments

jmhessel commented Feb 23, 2016

fchollet commented Feb 23, 2016

jmhessel commented Feb 23, 2016

jmhessel commented Feb 23, 2016

fchollet commented Feb 23, 2016

jmhessel commented Feb 23, 2016

chbrian commented Jun 15, 2016 • edited Loading

mlzxy commented Nov 24, 2016 • edited Loading

jmhessel commented Nov 24, 2016 • edited Loading

mlzxy commented Nov 24, 2016

martinbel commented Nov 25, 2016

jmhessel commented Nov 25, 2016 • edited Loading

martinbel commented Nov 25, 2016 • edited Loading

litingsjj commented Apr 7, 2017

litingsjj commented Apr 7, 2017

nouiz commented Apr 12, 2017 via email

chbrian commented Jun 15, 2016 •

edited

Loading

mlzxy commented Nov 24, 2016 •

edited

Loading

jmhessel commented Nov 24, 2016 •

edited

Loading

jmhessel commented Nov 25, 2016 •

edited

Loading

martinbel commented Nov 25, 2016 •

edited

Loading