New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example to compare RELU with SELU #6990

Merged
merged 13 commits into from Jun 16, 2017

Conversation

Projects
None yet
6 participants
@zafarali
Contributor

zafarali commented Jun 14, 2017

SELU was added in #6924.
This PR adds a script to compare RELU and SELU performance in an MLP

image

@fchollet

This comment has been minimized.

Collaborator

fchollet commented Jun 14, 2017

So SELU is significantly worse than ReLU on this example?

@zafarali

This comment has been minimized.

Contributor

zafarali commented Jun 14, 2017

Yes. Which is also what @bigsnarfdude sees in bigsnarfdude/SELU_Keras_Tutorial

@tboquet

This comment has been minimized.

Contributor

tboquet commented Jun 14, 2017

In the paper they use more layers to show that a self normalizing neural net could work better. I guess you should add more layers to fully test the claim this work.

@zafarali

This comment has been minimized.

Contributor

zafarali commented Jun 14, 2017

So I repeated with:

model_selu = Sequential()
model_selu.add(Dense(512, input_shape=(max_words,)))
model_selu.add(Activation('selu'))
model_selu.add(AlphaDropout(0.5))
model_selu.add(Dense(512))
model_selu.add(Activation('selu'))
model_selu.add(AlphaDropout(0.5))
model_selu.add(Dense(512))
model_selu.add(Activation('selu'))
model_selu.add(AlphaDropout(0.5))
model_selu.add(Dense(512))
model_selu.add(Activation('selu'))
model_selu.add(AlphaDropout(0.5))
model_selu.add(Dense(num_classes))
model_selu.add(Activation('softmax'))

and I got no appreciable improvement:

image

@fchollet

This comment has been minimized.

Collaborator

fchollet commented Jun 14, 2017

In theory, normalization is only useful when you have very deep networks; it helps with gradient propagation. Try less dropout, more layers, and training longer. You could also use smaller layers but more of them.

What's the likelihood that we're dealing with an implementation bug?

@drauh

This comment has been minimized.

Contributor

drauh commented Jun 14, 2017

In the paper the dropout rate range is .05-.1, inputs are standardized, kernel initializer is lecun_uniform and the optimizer is pure sgd. Maybe this can help...

@zafarali

This comment has been minimized.

Contributor

zafarali commented Jun 14, 2017

@zafarali

This comment has been minimized.

Contributor

zafarali commented Jun 14, 2017

The reduced size helped with the overfitting issue.

The reduced dropout and lecun_normal initializer helped it converge very fast and perform better than the regular MLP. Will make a few more modifications and paste an graph here.

@fchollet

This comment has been minimized.

Collaborator

fchollet commented Jun 14, 2017

Try the baseline network (relu) with both lecun_normal and the default glorot_uniform, so we can see the impact of the initializer alone.

@bigsnarfdude

This comment has been minimized.

bigsnarfdude commented Jun 15, 2017

The reduced size helped with the overfitting issue.

The reduced dropout and lecun_normal initializer helped it converge very fast and perform better than the regular MLP. 

The latest code with: kernel_initializer='lecun_normal'

graph below

selu

@fchollet

This comment has been minimized.

Collaborator

fchollet commented Jun 15, 2017

Looks like it's overfitting pretty badly, it starts overfitting at epoch 1 (selu version). Maybe increase dropout or reduce layer size?

@zafarali

This comment has been minimized.

Contributor

zafarali commented Jun 15, 2017

Thanks @bigsnarfdude !

So making a side by side comparison with some of the parameters:

Effect of initializer: glorot_uniform vs lecun_normal

4 dense layers, 16 units each
relu activation
dropout = 0.5
image

Effect of activation: relu vs selu

4 dense layers, 16 units each
dropout = 0.5
initializer=lecun_normal

image

Effect of dropout type: Dropout(0.5) vs AlphaDropout(0.5) with relu

4 dense layers, 16 units each
relu activation
initializer=lecun_normal

image

Effect of dropout type + rate: Dropout(0.5) vs AlphaDropout(0.05) with relu

4 dense layers, 16 units each
relu activation
initializer=lecun_normal

image

"Recommended" structure of SNNs: Dropout(0.5) vs AlphaDropout(0.05) with selu

4 dense layers, 16 units each
selu activation
initializer=lecun_normal

image

"Recommended" structure of SNNs: Dropout(0.5) vs AlphaDropout(0.1) with selu

4 dense layers, 16 units each
selu activation
initializer=lecun_normal

image

Effect of dropout rate on relu vs selu

4 dense layers, 16 units each
dropout = 0.5 for relu network
dropout = 0.1 for selu network
initializer=lecun_normal

image

Effect of Dropout type on selu networks

4 dense layers, 16 units each
selu activation function
initializer=lecun_normal

image

Effect of initializer: glorot_uniform vs lecun_normal

4 dense layers, 16 units each
selu activation
alphadropout = 0.1

image

@zafarali

This comment has been minimized.

Contributor

zafarali commented Jun 15, 2017

Graph for 222f613:

image

It does seem like synergize well together and the naive implementations are prone to dismissing it.

I still think the net is overfitting. What do you think @bigsnarfdude @fchollet?

@zafarali

This comment has been minimized.

Contributor

zafarali commented Jun 15, 2017

@fchollet I think there might be some value in making this a command-line executable script with default arguments (using argparse). This way someone who quickly wants to compare two architectures of SELU vs RELU can do:

python examples/reuters_mlp_with_selu.py -d 4 -h 16 -a1 relu -a2 selu -d1 dropout -d2 alphadropout -dr1 0.5 -dr2 0.05 -i1 glorot_uniform -i2 lecun_normal

Will make the above graphs reproducible. Opinions?

@fchollet

This comment has been minimized.

Collaborator

fchollet commented Jun 15, 2017

Certainly, it is good to fully parameterize the models in order to be able to run different configurations by only changing one variable. But the advantages of having access to command line arguments vs. just editing global variables at the beginning of a file, are slim. I'd recommend just having a list of global parameters, with reasonable defaults.

We could even make layer depth a configurable parameter.

@bigsnarfdude

This comment has been minimized.

bigsnarfdude commented Jun 15, 2017

@zafarali

RE: Graph for 222f613

The SELU with kernel_initializer='lecun_normal' loss appears more consistent with results I found using TF provided paper code. SELU function definitely requires AlphaDropout and the kernel_initializer to find the "magic".

@zafarali

This comment has been minimized.

Contributor

zafarali commented Jun 15, 2017

Commit 86e16ff:

image

Network 1 results
Hyperparameters: {'dropout': <class 'keras.layers.core.Dropout'>, 'kernel_initializer': 'glorot_uniform', 'dropout_rate': 0.5, 'n_dense': 6, 'dense_units': 16, 'activation': 'relu', 'optimizer': 'adam'}
Test score: 1.93495889508
Test accuracy: 0.567230632235
Network 2 results
Hyperparameters: {'dropout': <class 'keras.layers.noise.AlphaDropout'>, 'kernel_initializer': 'lecun_normal', 'dropout_rate': 0.1, 'n_dense': 6, 'dense_units': 16, 'activation': 'selu', 'optimizer': 'sgd'}
Test score: 1.75557634412
Test accuracy: 0.614425645645
@tboquet

Now this makes sense! I think most of the examples export graphs in png could you also follow this convention?

'dropout': AlphaDropout,
'dropout_rate': 0.1,
'kernel_initializer': 'lecun_normal',
'optimizer': 'sgd'

This comment has been minimized.

@fchollet

fchollet Jun 15, 2017

Collaborator

Not sure how fair the comparison is, if using a different optimizer and different dropout rate...

This comment has been minimized.

@zafarali

zafarali Jun 15, 2017

Contributor

I can set the optimizer to sgd in both.

What do you recommend we do with dropout? To be fair, they are not comparable 1-1 anyways (i.e Dropout(0.5)AlphaDropout(0.5))

This comment has been minimized.

@bigsnarfdude

bigsnarfdude Jun 15, 2017

They performed grid search, so I'm wondering if its possible to do apple to apple comparisons as it looks we are hand tuning the SELU/SSN parameters.

Best performing SNNs have 8 layers, compared to the runner-ups ReLU networks with 
layer normalization with 2 and 3 layers ... we preferred settings with a
higher number of layers, lower learning rates and higher dropout rates

This comment has been minimized.

@zafarali

zafarali Jun 16, 2017

Contributor

Changed to sgd

@@ -0,0 +1,153 @@
'''

This comment has been minimized.

@fchollet

fchollet Jun 16, 2017

Collaborator

Please start file with a one-line description ending with a period.

kernel_initializer: str. the initializer for the weights
optimizer: str/keras.optimizers.Optimizer. the optimizer to use
num_classes: int > 0. the number of classes to predict
max_words: int > 0. the maximum number of words per data point

This comment has been minimized.

@fchollet

fchollet Jun 16, 2017

Collaborator

Docstring needs a # Returns section

optimizer='adam',
num_classes=1,
max_words=max_words):
"""Generic function to create a fully connect neural network

This comment has been minimized.

@fchollet

fchollet Jun 16, 2017

Collaborator

"fully-connected". One-line description must end with a period.

score_model1 = model1.evaluate(x_test, y_test,
batch_size=batch_size, verbose=1)

This comment has been minimized.

@fchollet

fchollet Jun 16, 2017

Collaborator

One line per keyword argument (to avoid overly long lines), same below

zafarali and others added some commits Jun 16, 2017

@fchollet fchollet merged commit 8d5b2ce into keras-team:master Jun 16, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@antonmbk

This comment has been minimized.

Contributor

antonmbk commented Jul 15, 2017

@zafarali , @bigsnarfdude
Not sure if anyone will see this comment, but I'm curious: why was use_bias=False not set in all but the last Dense layers for the selu network? Isn't a bias not necessary due to the self-normalization? I may be mistaken, but wanted to ask if anyone knew the answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment