### Intro to Keras: Examples of different model architectures and parameter settings

In [1]:
import pickle
import time
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD

Using Theano backend.


In [2]:
train_x = pickle.load(open("MNIST_train_x.pkl", 'rb'))
train_y = pickle.load(open("MNIST_train_y.pkl", 'rb'))
test_x = pickle.load(open("MNIST_test_x.pkl", 'rb'))
test_y = pickle.load(open("MNIST_test_y.pkl", 'rb'))
train_x_short = train_x[:20000]
train_y_short = train_y[:20000]

### Quadratic cost (mean squared error) vs. categorical crossentropy
- Categorical cross-entropy significantly speeds up training
- Softmax output layers are the most appropriate for the MNIST problem since each image can only belong to one class and softmax outputs a proability distribution across the 10 classes.
    - As the value of one output node increases, the value of one or more other output nodes must decrease
    - This is consistent with our intuition that as we become more confident and image belongs to one class, we reduce our confidence that an image belongs to other classes

In [3]:
# Softmax output layer, mse
model = Sequential()
model.add(Dense(128, input_dim=784))
model.add(Activation('sigmoid'))
model.add(Dense(10))
model.add(Activation('softmax'))

sgd = SGD(lr=0.01)
model.compile(optimizer=sgd, loss='mse', metrics=['accuracy'])
model.fit(train_x_short, train_y_short, batch_size=128, nb_epoch=10, validation_split=0.2, verbose=2)
print()

# Softmax output layer, categorical crossentropy
model = Sequential()
model.add(Dense(128, input_dim=784))
model.add(Activation('sigmoid'))
model.add(Dense(10))
model.add(Activation('softmax'))

sgd = SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_x_short, train_y_short, batch_size=128, nb_epoch=10, validation_split=0.2, verbose=2)

Train on 16000 samples, validate on 4000 samples
Epoch 1/10
0s - loss: 0.0943 - acc: 0.1286 - val_loss: 0.0920 - val_acc: 0.1608
Epoch 2/10
0s - loss: 0.0907 - acc: 0.1831 - val_loss: 0.0893 - val_acc: 0.2165
Epoch 3/10
0s - loss: 0.0879 - acc: 0.2344 - val_loss: 0.0868 - val_acc: 0.2627
Epoch 4/10
0s - loss: 0.0854 - acc: 0.2835 - val_loss: 0.0843 - val_acc: 0.3023
Epoch 5/10
0s - loss: 0.0828 - acc: 0.3231 - val_loss: 0.0819 - val_acc: 0.3392
Epoch 6/10
0s - loss: 0.0803 - acc: 0.3548 - val_loss: 0.0796 - val_acc: 0.3690
Epoch 7/10
0s - loss: 0.0780 - acc: 0.3823 - val_loss: 0.0775 - val_acc: 0.3977
Epoch 8/10
0s - loss: 0.0758 - acc: 0.4104 - val_loss: 0.0754 - val_acc: 0.4270
Epoch 9/10
0s - loss: 0.0738 - acc: 0.4395 - val_loss: 0.0735 - val_acc: 0.4500
Epoch 10/10
0s - loss: 0.0719 - acc: 0.4669 - val_loss: 0.0718 - val_acc: 0.4730

Train on 16000 samples, validate on 4000 samples
Epoch 1/10
0s - loss: 1.4947 - acc: 0.5833 - val_loss: 1.0786 - val_acc: 0.7435
Epoch 2/10
0s - loss

<keras.callbacks.History at 0x113f0a630>

### Reducing the batch_size tends to increase the amount learnt per epoch, but also increases time to complete an epoch
- In the experiments below total time to reach a comparable accuracy level was broadly similar
- Reducing batch size from 32 to 16 appeared to hurt performance

In [4]:
start = time.time()
model = Sequential()
model.add(Dense(128, input_dim=784))
model.add(Activation('sigmoid'))
model.add(Dense(10))
model.add(Activation('softmax'))

sgd = SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_x_short, train_y_short, batch_size=128, nb_epoch=10, validation_split=0.2, verbose=2)
end = time.time()
print("Model took  {} seconds to complete".format(end - start))

Train on 16000 samples, validate on 4000 samples
Epoch 1/10
0s - loss: 1.5686 - acc: 0.5379 - val_loss: 1.1132 - val_acc: 0.7535
Epoch 2/10
0s - loss: 0.9154 - acc: 0.7976 - val_loss: 0.8008 - val_acc: 0.8227
Epoch 3/10
0s - loss: 0.7039 - acc: 0.8436 - val_loss: 0.6599 - val_acc: 0.8562
Epoch 4/10
0s - loss: 0.5978 - acc: 0.8657 - val_loss: 0.5771 - val_acc: 0.8685
Epoch 5/10
0s - loss: 0.5285 - acc: 0.8774 - val_loss: 0.5265 - val_acc: 0.8758
Epoch 6/10
0s - loss: 0.4787 - acc: 0.8893 - val_loss: 0.4821 - val_acc: 0.8858
Epoch 7/10
0s - loss: 0.4441 - acc: 0.8946 - val_loss: 0.4598 - val_acc: 0.8878
Epoch 8/10
0s - loss: 0.4153 - acc: 0.9023 - val_loss: 0.4347 - val_acc: 0.8898
Epoch 9/10
0s - loss: 0.3910 - acc: 0.9059 - val_loss: 0.4121 - val_acc: 0.8940
Epoch 10/10
0s - loss: 0.3705 - acc: 0.9123 - val_loss: 0.3990 - val_acc: 0.8978
Model took  6.8437159061431885 seconds to complete


In [5]:
start = time.time()
model = Sequential()
model.add(Dense(128, input_dim=784))
model.add(Activation('sigmoid'))
model.add(Dense(10))
model.add(Activation('softmax'))

sgd = SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_x_short, train_y_short, batch_size=64, nb_epoch=7, validation_split=0.2, verbose=2)
end = time.time()
print("Model took  {} seconds to complete".format(end - start))

Train on 16000 samples, validate on 4000 samples
Epoch 1/7
0s - loss: 1.2380 - acc: 0.6774 - val_loss: 0.8062 - val_acc: 0.8157
Epoch 2/7
0s - loss: 0.6709 - acc: 0.8466 - val_loss: 0.5969 - val_acc: 0.8572
Epoch 3/7
0s - loss: 0.5269 - acc: 0.8778 - val_loss: 0.5000 - val_acc: 0.8750
Epoch 4/7
1s - loss: 0.4545 - acc: 0.8897 - val_loss: 0.4575 - val_acc: 0.8875
Epoch 5/7
0s - loss: 0.4111 - acc: 0.8992 - val_loss: 0.4173 - val_acc: 0.8915
Epoch 6/7
0s - loss: 0.3734 - acc: 0.9071 - val_loss: 0.3874 - val_acc: 0.8988
Epoch 7/7
0s - loss: 0.3471 - acc: 0.9130 - val_loss: 0.3633 - val_acc: 0.9022
Model took  6.3559348583221436 seconds to complete


In [6]:
start = time.time()
model = Sequential()
model.add(Dense(128, input_dim=784))
model.add(Activation('sigmoid'))
model.add(Dense(10))
model.add(Activation('softmax'))

sgd = SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_x_short, train_y_short, batch_size=32, nb_epoch=6, validation_split=0.2, verbose=2)
end = time.time()
print("Model took  {} seconds to complete".format(end - start))

Train on 16000 samples, validate on 4000 samples
Epoch 1/6
0s - loss: 1.0527 - acc: 0.7224 - val_loss: 0.6616 - val_acc: 0.8365
Epoch 2/6
0s - loss: 0.5545 - acc: 0.8621 - val_loss: 0.5231 - val_acc: 0.8645
Epoch 3/6
0s - loss: 0.4575 - acc: 0.8848 - val_loss: 0.4338 - val_acc: 0.8838
Epoch 4/6
0s - loss: 0.3986 - acc: 0.8976 - val_loss: 0.3936 - val_acc: 0.8940
Epoch 5/6
0s - loss: 0.3635 - acc: 0.9054 - val_loss: 0.3775 - val_acc: 0.8990
Epoch 6/6
0s - loss: 0.3488 - acc: 0.9076 - val_loss: 0.3581 - val_acc: 0.9005
Model took  5.353495836257935 seconds to complete


In [7]:
start = time.time()
model = Sequential()
model.add(Dense(128, input_dim=784))
model.add(Activation('sigmoid'))
model.add(Dense(10))
model.add(Activation('softmax'))

sgd = SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_x_short, train_y_short, batch_size=16, nb_epoch=6, validation_split=0.2, verbose=2)
end = time.time()
print("Model took  {} seconds to complete".format(end - start))

Train on 16000 samples, validate on 4000 samples
Epoch 1/6
1s - loss: 0.9557 - acc: 0.7491 - val_loss: 0.6549 - val_acc: 0.8355
Epoch 2/6
1s - loss: 0.5316 - acc: 0.8621 - val_loss: 0.4923 - val_acc: 0.8618
Epoch 3/6
1s - loss: 0.4494 - acc: 0.8814 - val_loss: 0.4298 - val_acc: 0.8835
Epoch 4/6
1s - loss: 0.4101 - acc: 0.8922 - val_loss: 0.3863 - val_acc: 0.8972
Epoch 5/6
1s - loss: 0.3795 - acc: 0.8991 - val_loss: 0.4022 - val_acc: 0.8922
Epoch 6/6
1s - loss: 0.3677 - acc: 0.8979 - val_loss: 0.3580 - val_acc: 0.8965
Model took  6.903819799423218 seconds to complete


### Relu + softmax
- Needs a low learning rate for the network to learn anything
- Performs worse than a sigmoid hidden layer for shallow networks

In [8]:
model = Sequential()
model.add(Dense(128, input_dim=784))
model.add(Activation('relu'))
model.add(Dense(10))
model.add(Activation('softmax'))

sgd = SGD(lr=0.001)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_x_short, train_y_short, batch_size=32, nb_epoch=10, validation_split=0.2, verbose=2)

Train on 16000 samples, validate on 4000 samples
Epoch 1/10
0s - loss: 10.8614 - acc: 0.3217 - val_loss: 10.5951 - val_acc: 0.3382
Epoch 2/10
0s - loss: 9.7185 - acc: 0.3927 - val_loss: 9.5325 - val_acc: 0.4035
Epoch 3/10
0s - loss: 8.8254 - acc: 0.4496 - val_loss: 8.8532 - val_acc: 0.4465
Epoch 4/10
0s - loss: 8.7444 - acc: 0.4550 - val_loss: 8.8393 - val_acc: 0.4495
Epoch 5/10
0s - loss: 8.0795 - acc: 0.4961 - val_loss: 7.4797 - val_acc: 0.5323
Epoch 6/10
0s - loss: 7.2350 - acc: 0.5482 - val_loss: 7.4038 - val_acc: 0.5373
Epoch 7/10
0s - loss: 6.7769 - acc: 0.5744 - val_loss: 6.2869 - val_acc: 0.6025
Epoch 8/10
0s - loss: 5.8232 - acc: 0.6324 - val_loss: 6.0675 - val_acc: 0.6155
Epoch 9/10
0s - loss: 5.3256 - acc: 0.6641 - val_loss: 5.6016 - val_acc: 0.6480
Epoch 10/10
0s - loss: 4.8013 - acc: 0.6951 - val_loss: 4.9422 - val_acc: 0.6850


<keras.callbacks.History at 0x118bd9e48>

In [9]:
model2 = Sequential()
model2.add(Dense(128, input_dim=784))
model2.add(Activation('sigmoid'))
model2.add(Dense(10))
model2.add(Activation('softmax'))

sgd = SGD(lr=0.01)
model2.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
model2.fit(train_x_short, train_y_short, batch_size=32, nb_epoch=10, validation_split=0.2, verbose=2)

Train on 16000 samples, validate on 4000 samples
Epoch 1/10
0s - loss: 1.0783 - acc: 0.7277 - val_loss: 0.7008 - val_acc: 0.8327
Epoch 2/10
0s - loss: 0.5685 - acc: 0.8558 - val_loss: 0.5170 - val_acc: 0.8698
Epoch 3/10
0s - loss: 0.4600 - acc: 0.8815 - val_loss: 0.4459 - val_acc: 0.8838
Epoch 4/10
0s - loss: 0.3988 - acc: 0.8950 - val_loss: 0.4047 - val_acc: 0.8962
Epoch 5/10
0s - loss: 0.3652 - acc: 0.9046 - val_loss: 0.3760 - val_acc: 0.9058
Epoch 6/10
0s - loss: 0.3436 - acc: 0.9065 - val_loss: 0.3485 - val_acc: 0.9067
Epoch 7/10
0s - loss: 0.3132 - acc: 0.9136 - val_loss: 0.3331 - val_acc: 0.9117
Epoch 8/10
0s - loss: 0.3031 - acc: 0.9187 - val_loss: 0.3267 - val_acc: 0.9107
Epoch 9/10
0s - loss: 0.2942 - acc: 0.9194 - val_loss: 0.3043 - val_acc: 0.9165
Epoch 10/10
0s - loss: 0.2837 - acc: 0.9251 - val_loss: 0.3081 - val_acc: 0.9187


<keras.callbacks.History at 0x117a75198>

### Relu really comes into its own for deep networks
- Deeper network tend to perform better than shallow networks for complex tasks
- But they are hard to train. Relu's make it easier for deep networks to learn because their gradients don't saturate for postive inputs

In [10]:
model3 = Sequential()
model3.add(Dense(512, input_dim=784))
model3.add(Activation('relu'))
model3.add(Dense(256))
model3.add(Activation('relu'))
model3.add(Dense(128))
model3.add(Activation('relu'))
model3.add(Dense(64))
model3.add(Activation('relu'))
model3.add(Dense(10))
model3.add(Activation('softmax'))

sgd = SGD(lr=0.001)
model3.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
model3.fit(train_x_short, train_y_short, batch_size=32, nb_epoch=10, validation_split=0.2, verbose=2)

Train on 16000 samples, validate on 4000 samples
Epoch 1/10
1s - loss: 8.2822 - acc: 0.4715 - val_loss: 7.4159 - val_acc: 0.5308
Epoch 2/10
2s - loss: 6.0692 - acc: 0.6049 - val_loss: 4.7384 - val_acc: 0.6823
Epoch 3/10
2s - loss: 3.8278 - acc: 0.7366 - val_loss: 2.1670 - val_acc: 0.8093
Epoch 4/10
1s - loss: 0.9286 - acc: 0.9003 - val_loss: 0.9505 - val_acc: 0.8902
Epoch 5/10
2s - loss: 0.4477 - acc: 0.9386 - val_loss: 0.5667 - val_acc: 0.9120
Epoch 6/10
1s - loss: 0.2416 - acc: 0.9581 - val_loss: 0.4728 - val_acc: 0.9240
Epoch 7/10
1s - loss: 0.1488 - acc: 0.9739 - val_loss: 0.4283 - val_acc: 0.9283
Epoch 8/10
1s - loss: 0.1016 - acc: 0.9823 - val_loss: 0.4186 - val_acc: 0.9317
Epoch 9/10
1s - loss: 0.0720 - acc: 0.9909 - val_loss: 0.3961 - val_acc: 0.9325
Epoch 10/10
1s - loss: 0.0568 - acc: 0.9945 - val_loss: 0.3994 - val_acc: 0.9367


<keras.callbacks.History at 0x113d3beb8>

In [11]:
model3 = Sequential()
model3.add(Dense(512, input_dim=784))
model3.add(Activation('sigmoid'))
model3.add(Dense(256))
model3.add(Activation('sigmoid'))
model3.add(Dense(128))
model3.add(Activation('sigmoid'))
model3.add(Dense(64))
model3.add(Activation('sigmoid'))
model3.add(Dense(10))
model3.add(Activation('softmax'))

sgd = SGD(lr=0.01)
model3.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
model3.fit(train_x_short, train_y_short, batch_size=32, nb_epoch=10, validation_split=0.2, verbose=2)

Train on 16000 samples, validate on 4000 samples
Epoch 1/10
3s - loss: 2.3056 - acc: 0.1231 - val_loss: 2.2900 - val_acc: 0.1182
Epoch 2/10
5s - loss: 2.2853 - acc: 0.1509 - val_loss: 2.2806 - val_acc: 0.1182
Epoch 3/10
4s - loss: 2.2729 - acc: 0.1845 - val_loss: 2.2647 - val_acc: 0.2125
Epoch 4/10
3s - loss: 2.2556 - acc: 0.2496 - val_loss: 2.2444 - val_acc: 0.3855
Epoch 5/10
4s - loss: 2.2294 - acc: 0.3211 - val_loss: 2.2120 - val_acc: 0.3033
Epoch 6/10
4s - loss: 2.1866 - acc: 0.3882 - val_loss: 2.1569 - val_acc: 0.4338
Epoch 7/10
3s - loss: 2.1122 - acc: 0.4689 - val_loss: 2.0596 - val_acc: 0.4693
Epoch 8/10
3s - loss: 1.9872 - acc: 0.4995 - val_loss: 1.9100 - val_acc: 0.5230
Epoch 9/10
3s - loss: 1.8193 - acc: 0.5250 - val_loss: 1.7310 - val_acc: 0.5182
Epoch 10/10
3s - loss: 1.6228 - acc: 0.5627 - val_loss: 1.5260 - val_acc: 0.5657


<keras.callbacks.History at 0x118234a90>