Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BN Questions (old) #1802

Closed
jmhessel opened this issue Feb 23, 2016 · 15 comments
Closed

BN Questions (old) #1802

jmhessel opened this issue Feb 23, 2016 · 15 comments

Comments

@jmhessel
Copy link
Contributor

Hey All,

I posted this in keras-users but haven't gotten any replies, but I've been doing some thinking about the keras implementation of Batch Normalization.

The problems with the current implementation are threefold:

A) A batch normalization layer negates the bias term of the previous layer (i.e. BN(Wx+b) = BN(Wx)) which wastes memory and computation. Because all batch normalized activations are shifted by the mean over the batch, any bias term added in from the previous layer will be entirely canceled out. In batch normalized networks, you don't need bias terms.

model.add(Dense(100, activation = 'linear'))
model.add(BatchNormalization())
model.add(Activation('relu'))

is currently wrong, because there are an extra set of bias parameters in the dense layer that are useless.

B) This point was correct, I am mistaken.

C) The API makes it somewhat easy to incorrectly apply BN. BN should be applied before activations are applied, i.e.

model.add(Dense(100, activation = 'relu'))
model.add(BatchNormalization())

is wrong. In addition to having an extraneous bias term, BN should be applied before the activation.

Because it seems BN is here to stay, so to speak, it might be worth re-thinking the batch normalization layer. Having it as its own layer is hard because its behavior should depend on the type of layer before it. There are three possible routes.

One could be having a parameter in the Model class that by default adds batch normalization to each layer, and removes all biases from the network. This might look like

## A correct model
model = Sequential(batch_norm = True)
model.add(Convolution2D(32, 3, 3, border_mode='valid', input_shape=(3, 100, 100)))
model.add(Flatten())
model.add(Dense(100, input_dim=(20,), activation = 'relu'))
model.add(Dense(500, input_dim=(20,), activation = 'relu'))
model.add(Dense(10, input_dim=(20,), activation = 'softmax'))
...

This would make the API clean, but would also require some reworking of layers (i.e. if batch_norm mode was on, the parameters of the layers would not include bias terms)

Another option would be to include it as a property of layers themselves.

## A correct model
model = Sequential()
model.add(Convolution2D(32, 3, 3, border_mode='valid', input_shape=(3, 100, 100),
no_bias=True, batch_norm = True))
model.add(Flatten())
model.add(Dense(100, input_dim=(20,), no_bias = True, batch_norm = True, activation = 'relu'))
model.add(Dense(500, input_dim=(20,), no_bias = True, batch_norm = True, activation = 'relu'))
model.add(Dense(10, input_dim=(20,), batch_norm = True, activation = 'softmax'))
...

This is less clean, but would require less implementation.

The last option is to keep it as its own layer, and it performs inference to figure out what it should be doing, along with warnings about previous layers having biases/activations

## A correct model
model = Sequential()
model.add(Convolution2D(32, 3, 3, border_mode='valid', input_shape=(3, 100, 100), no_bias=True))
model.add(BatchNormalization()) #would perform the correct convolutional batch normalization
model.add(Activation('relu')) #would raise no warnings about post-relu values being BN'ed -- this is right
model.add(Flatten())
model.add(Dense(100, no_bias = True))
model.add(BatchNormalization())
model.add(Activation('relu'))
...

This is the least clean API wise, but easiest to implement, because everything is contained in the BatchNormalization Layer itself.

Thoughts? I'd be happy to work on this, but I don't know which direction would be best.

@jmhessel jmhessel changed the title Batch Normalization is Wrong Batch Normalization is Wrong? Feb 23, 2016
@fchollet
Copy link
Member

Point A is "the fact that biases are bundled with Dense layers adds a negligible amount of overhead in situations where I don't need biases". Well, it's trivial to roll out a Dense layer with no biases if you need that.

Point B is 100% factually wrong.

Point C is wrong, for several reasons, 1) because it's usually preferable to apply BN after relu and 2) because the more common way to apply relu in Keras is via Activation('relu').

Overall: we could consider adding a keyword argument in Dense (and other layers) to make biases optional. But I don't think that is all that useful or necessary.

@jmhessel
Copy link
Contributor Author

Apologies for misreading the code for B -- I guess I don't have the best understanding of how it's implemented. When you say it's factually wrong, do you mean that you shouldn't have a single gamma/beta term for each feature map, and that it's better do do things on a per-activation basis? Or that the current implementation already has this?

Indeed, the computational and memory requirements for a bias term are only linear in the layer size. It would be relatively simple to add a dense layer without bias, sure. If the direction of Keras is agnostic to choices like this, fine -- I just thought I'd bring it up.

For point C -- perhaps I misunderstood the paper, but in section 3.2 they suggest doing g(BN(Wu)) where g is the activation function. Perhaps I am misunderstanding something?

@jmhessel
Copy link
Contributor Author

Ahh -- apologies for point B. I am definitely mistaken.

@fchollet
Copy link
Member

Keras BN normalizes per feature. The axis on which features are to be found can be configured (with convolutions under Theano conventions it needs to be set to 1. It defaults to -1).

For point C -- perhaps I misunderstood the paper, but in section 3.2 they suggest doing g(BN(Wu)) where g is the activation function. Perhaps I am misunderstanding something?

I haven't gone back to check what they are suggesting in their original paper, but I can guarantee that recent code written by Christian applies relu before BN. It is still occasionally a topic of debate, though.

@jmhessel
Copy link
Contributor Author

Fair enough. If the linear memory/computation isn't something worth addressing and if Christian is using BN before relu (it doesn't really matter because Keras supports both, but, IMO the API biases users towards doing the opposite of the paper) then all is well. Again, sorry for point B.

@chbrian
Copy link

chbrian commented Jun 15, 2016

Here are some evaluations of BatchNormalization. It seems it's better to put BatchNormalization after RELU
https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md

@jmhessel jmhessel changed the title Batch Normalization is Wrong? BN Questions (old) Sep 22, 2016
@mlzxy
Copy link

mlzxy commented Nov 24, 2016

@jmhessel I do think the original batch normalization paper mentions you should apply BN after activation, instead of before. Well in practice it becomes an experimental thing.

@jmhessel
Copy link
Contributor Author

jmhessel commented Nov 24, 2016

@benbbear From the original paper

We add the BN transform immediately before the nonlinearity, by normalizing x = Wu+ b. We could have also normalized the layer inputs u, but since u is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift. In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian” Hyvarinen & Oja (2000); normalizing it is likely to produce activations with a stable distribution.

So, the original paper did say you should do it before the activation, hence my original concern with the keras API. I guess this part proved to be less important, though. I thought it might be problematic but didn't do the experiments myself.

@mlzxy
Copy link

mlzxy commented Nov 24, 2016

@jmhessel

Just review it again.

Yes I am indeed wrong. Thanks so much for your correction!

@martinbel
Copy link

The original paper mentions this approach and the He "plain net" implementation also. In He's words from the 2015 resnet paper: "We adopt batch normalization right after each convolution and before activation, following [16]."

A simple example:

model.add(Convolution2D(64, 3, 3, border_mode='same', init=init))
model.add(BatchNormalization(axis=1)) # axis=-1 for theano, -1 for tensorflow
model.add(Activation('relu')) # Could be any other activation

@jmhessel
Copy link
Contributor Author

jmhessel commented Nov 25, 2016

@martinbel Actually, that is how the ResNet50 in Keras is implemented (in the applications folder). Not that any of this really matters to Keras, specifically; both are possible, but the only difference is that you can't use the activation = ... shortcut in the layer construction if you are applying BN in the originally intended order. Though I am not 100% convinced that either is correct, I am convinced that the difference in performance is marginal and/or just another hyperparameter.

@martinbel
Copy link

martinbel commented Nov 25, 2016

@jmhessel I've added this just to help others looking for a simple, concrete answer to how to implement batch normalization in keras as it's implemented in the paper. The resnet paper quote refers to how He implements the "plain nets", not only resnets. I agree you can't use the activation =... for doing this.

@litingsjj
Copy link

@martinbel what's mean about "# axis=-1 for theano, -1 for tensorflow" ?

I reference keras.io: "axis: Integer, the axis that should be normalized (typically the features axis). For instance, after a Conv2D layer with data_format="channels_first", set axis=1 in BatchNormalization."
Is it different for theano or tensorflow?

@litingsjj
Copy link

When I use BatchNormalization() before every activation(Conv), it takes 3656s for one epoch. Without BatchNormalization(), it takes 216s . Is it really need BatchNormalization() about every activation or how to choose BatchNormalization()?

@nouiz
Copy link
Contributor

nouiz commented Apr 12, 2017 via email

facebook-github-bot pushed a commit to facebookresearch/Pearl that referenced this issue Sep 21, 2023
Summary:
Change the order of BN with respect to other modules in a layer.
Based on this reference: keras-team/keras#1802 (comment)

Differential Revision: D49447550

fbshipit-source-id: aa03c18de8a566e5d07659b28459b5d8ecb4c881
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants