# Lesson 1 Homework: Cats vs. Dogs
The goal of this assignment is to get to the top 50% of the Dogs vs. Cats competition.
From https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/leaderboard,
this is roughly a log loss of 0.12.

My first idea is to use pretrained VGG16 from keras and train some number of fully connected layers on top.

In [22]:
2/np.sqrt(2000)

0.044721359549995794

In [10]:
# following https://keras.io/applications/#vgg16
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
from keras.layers import Dense, Flatten
from keras.models import Model

import numpy as np

In [60]:
# pooling converts last layer into 2D tensor
base_model = VGG16(weights='imagenet', include_top=False, pooling="avg")

In [61]:
base_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         (None, None, None, 3)     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, None, None, 64)    1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, None, None, 64)    36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, None, None, 64)    0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, None, None, 128)   73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, None, None, 128)   147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, None, None, 128)   0         
__________

Add layers

In [62]:
x = base_model.output
# Fully connected layer
x = Dense(1024, activation='relu')(x)
# final outputs
predictions = Dense(2, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)

Training setup

In [63]:
# first: train only the top layers (which were randomly initialized)
# i.e. freeze all VGG layers
for layer in base_model.layers:
    layer.trainable = False

In [64]:
# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=["accuracy"])

In [65]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         (None, None, None, 3)     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, None, None, 64)    1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, None, None, 64)    36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, None, None, 64)    0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, None, None, 128)   73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, None, None, 128)   147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, None, None, 128)   0         
__________

### Fit

In [2]:
batch_size=32 # 128 is too much for GTX 1060 6GB

In [3]:
def preprocess_tf(x):
    return preprocess_input(x, mode="tf")

In [4]:
# Need to preprocess input here?
def get_batches(dirname, gen=image.ImageDataGenerator(preprocessing_function=preprocess_tf), shuffle=True, 
                batch_size=batch_size, class_mode='categorical'):
    return gen.flow_from_directory(dirname, target_size=(224,224), 
                class_mode=class_mode, shuffle=shuffle, batch_size=batch_size)

In [5]:
path = "data/dogscats/"
# For testing code, not enough data for anything serious
#path = "data/dogscats/sample/"

In [6]:
train_batches = get_batches(path + 'train', batch_size=batch_size)
val_batches = get_batches(path + 'valid', batch_size=batch_size)

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.


In [12]:
imgs, labels = next(train_batches)

In [13]:
imgs.shape

(64, 224, 224, 3)

In [14]:
labels.shape

(64, 2)

In [15]:
len(train_batches)

360

In [16]:
len(val_batches)

32

In [17]:
23000/64

359.375

In [18]:
2000/64

31.25

`len()` of a batch generator is the number of batches for the full dataset

In [21]:
model.fit_generator(train_batches, steps_per_epoch=len(train_batches), epochs=1, 
                    validation_data=val_batches, validation_steps=len(val_batches))

Epoch 1/1


<keras.callbacks.History at 0x1ab04082fd0>

### Try different preprocessing:
For each option, I created a new model object and trained one epoch.

With `preprocessing_function=preprocess_input`:

In [32]:
model.fit_generator(train_batches, steps_per_epoch=len(train_batches), epochs=1, 
                    validation_data=val_batches, validation_steps=len(val_batches))

Epoch 1/1


<keras.callbacks.History at 0x1ac7d253dd8>

In [41]:
def preprocess_tf(x):
    return preprocess_input(x, mode="tf")

With `preprocessing_function=preprocess_tf`:

In [45]:
model.fit_generator(train_batches, steps_per_epoch=len(train_batches), epochs=1, 
                    validation_data=val_batches, validation_steps=len(val_batches))

Epoch 1/1


<keras.callbacks.History at 0x1ac7d3fa588>

With no `preprocessing_function` specified:

In [56]:
model.fit_generator(train_batches, steps_per_epoch=len(train_batches), epochs=1, 
                    validation_data=val_batches, validation_steps=len(val_batches))

Epoch 1/1


<keras.callbacks.History at 0x1ac7f061898>

`preprocessing_function=preprocess_tf` is the right choice.

### What's the asymptotic performance of this model?
if trained until stops improving.

In [70]:
model.fit_generator(train_batches, steps_per_epoch=len(train_batches), epochs=5, 
                    validation_data=val_batches, validation_steps=len(val_batches))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
  7/360 [..............................] - ETA: 3:03 - loss: 0.0926 - acc: 0.9621

KeyboardInterrupt: 

Nominally, this satisfies the homework.

The training and validation loss converge, so to improve further I need a higher variance model.
Maybe try the top VGG layers?

### Try VGG's top layer architecture

In [72]:
VGG16(weights=None, include_top=True).summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

In [7]:
# pooling converts last layer into 2D tensor
base_model = VGG16(weights='imagenet', include_top=False, pooling="avg")
x = base_model.output
# 2x Fully connected layer
x = Dense(1024, activation='relu')(x)
x = Dense(1024, activation='relu')(x)
# final outputs
predictions = Dense(2, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)
# first: train only the top layers (which were randomly initialized)
# i.e. freeze all VGG layers
for layer in base_model.layers:
    layer.trainable = False
# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=["accuracy"])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, None, None, 3)     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, None, None, 64)    1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, None, None, 64)    36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, None, None, 64)    0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, None, None, 128)   73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, None, None, 128)   147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, None, None, 128)   0         
__________

In [8]:
model.fit_generator(train_batches, steps_per_epoch=len(train_batches), epochs=1, 
                    validation_data=val_batches, validation_steps=len(val_batches))

Epoch 1/1


<keras.callbacks.History at 0x246da7ae940>

In [9]:
model.fit_generator(train_batches, steps_per_epoch=len(train_batches), epochs=2, 
                    validation_data=val_batches, validation_steps=len(val_batches))

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x246dab896d8>

This is probably worse.

### Flatten the final pre-trained layer instead of doing average pooling. 
Then use a smaller dense layer to make the number of parameters reasonable.

In [15]:
# pooling converts last layer into 2D tensor
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
x = base_model.output
# Flatten
x = Flatten()(x)
# 2x Fully connected layer
x = Dense(128, activation='relu')(x)
x = Dense(1024, activation='relu')(x)
# final outputs
predictions = Dense(2, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)
# first: train only the top layers (which were randomly initialized)
# i.e. freeze all VGG layers
for layer in base_model.layers:
    layer.trainable = False
# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=["accuracy"])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

In [16]:
model.fit_generator(train_batches, steps_per_epoch=len(train_batches), epochs=1, 
                    validation_data=val_batches, validation_steps=len(val_batches))

Epoch 1/1


<keras.callbacks.History at 0x247fb69edd8>

In [18]:
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
x = base_model.output
# Flatten
x = Flatten()(x)
# Fully connected layer
x = Dense(64, activation='relu')(x)
# final outputs
predictions = Dense(2, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)
# first: train only the top layers (which were randomly initialized)
# i.e. freeze all VGG layers
for layer in base_model.layers:
    layer.trainable = False
# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=["accuracy"])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_8 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

In [19]:
model.fit_generator(train_batches, steps_per_epoch=len(train_batches), epochs=1, 
                    validation_data=val_batches, validation_steps=len(val_batches))

Epoch 1/1


<keras.callbacks.History at 0x247fcc40c50>

In [20]:
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
x = base_model.output
# Flatten
x = Flatten()(x)
# final outputs
predictions = Dense(2, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)
# first: train only the top layers (which were randomly initialized)
# i.e. freeze all VGG layers
for layer in base_model.layers:
    layer.trainable = False
# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=["accuracy"])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_9 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

In [21]:
model.fit_generator(train_batches, steps_per_epoch=len(train_batches), epochs=1, 
                    validation_data=val_batches, validation_steps=len(val_batches))

Epoch 1/1


<keras.callbacks.History at 0x248010d4b38>

According to https://arxiv.org/abs/1409.1556, the VGG nonlinearity is ReLU.

# More ideas:
* Retrain the top convolutional layer
* Change optimizers