# Cats vs Dogs  Solve
https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition

Using VGG16 model to solve above kaggle competition

Usage Notes:
* Cells should be convereted to raw if they don't need to be run

Things to work on:
* Loop over training and learning rate array
* Data augmentation
* Save VGG16 architecture to file and just load that shit!

References
* https://github.com/fastai/courses/blob/master/deeplearning1/nbs/dogscats-ensemble.ipynb
* https://github.com/fastai/courses/blob/master/deeplearning1/nbs/dogs_cats_redux.ipynb


Process
1. Start as VGG16 model
1. Train the last layer, precomputing the rest

## Initialisation

In [1]:
import numpy as np
import pandas as pd
import os.path

from vgg16 import VGG_16
from utils import *


from keras.models import load_model

#Jupyter Specific
%matplotlib inline
from IPython.display import display

Using Theano backend.
ERROR (theano.gpuarray): pygpu was configured but could not be imported or is too old (version 0.6 or higher required)
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/site-packages/theano/gpuarray/__init__.py", line 21, in <module>
    import pygpu
ImportError: No module named pygpu


In [2]:
# Parameters
dir_case = './data/dogscats/'
dir_data = dir_case + 'population/'
ensambles_nb = 1

# Sane Defaults, but feel free to change
dir_model =  dir_case + 'model/'
dir_submissions =  dir_case + 'submissions/'
fname_submission =  'Kaggle_CatsDogs'
fname_stats = dir_case + 'stats.csv'
batch_size = 24

# Constants
IMAGE_SIZE = (224, 224)

## Data Setup

In [3]:
generators = {}

generators['train'] = image.ImageDataGenerator().flow_from_directory(dir_data+'train', 
                                                                     batch_size=batch_size,
                                                                     shuffle=False,
                                                                     class_mode='categorical',
                                                                     target_size=IMAGE_SIZE)

generators['valid'] = image.ImageDataGenerator().flow_from_directory(dir_data+'valid', 
                                                                     batch_size=batch_size,
                                                                     shuffle=False,
                                                                     class_mode='categorical',
                                                                     target_size=IMAGE_SIZE)

generators['test']  = image.ImageDataGenerator().flow_from_directory(dir_data+'test', 
                                                                     batch_size=batch_size,
                                                                     shuffle=False,
                                                                     class_mode=None,
                                                                     target_size=IMAGE_SIZE)

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.


## Looking at Data Augmentation


Keras ImageDataGenerator Arugments:

* featurewise_center: Boolean. Set input mean to 0 over the dataset, feature-wise.
* samplewise_center: Boolean. Set each sample mean to 0.
* featurewise_std_normalization: Boolean. Divide inputs by std of the dataset, feature-wise.
* samplewise_std_normalization: Boolean. Divide each input by its std.
* zca_epsilon: epsilon for ZCA whitening. Default is 1e-6.
* zca_whitening: Boolean. Apply ZCA whitening.
* rotation_range: Int. Degree range for random rotations.
* width_shift_range: Float (fraction of total width). Range for random horizontal shifts.
* height_shift_range: Float (fraction of total height). Range for random vertical shifts.
* shear_range: Float. Shear Intensity (Shear angle in counter-clockwise direction as radians)
* zoom_range: Float or [lower, upper]. Range for random zoom. If a float, [lower, upper] = [1-zoom_range, 1+zoom_range].
* channel_shift_range: Float. Range for random channel shifts.
* fill_mode: One of {"constant", "nearest", "reflect" or "wrap"}. Points outside the boundaries of the input are filled according to the given mode:
        "constant": kkkkkkkk|abcd|kkkkkkkk (cval=k)
        "nearest": aaaaaaaa|abcd|dddddddd
        "reflect": abcddcba|abcd|dcbaabcd
        "wrap": abcdabcd|abcd|abcdabcd
* cval: Float or Int. Value used for points outside the boundaries when fill_mode = "constant".
* horizontal_flip: Boolean. Randomly flip inputs horizontally.
* vertical_flip: Boolean. Randomly flip inputs vertically.
* rescale: rescaling factor. Defaults to None. If None or 0, no rescaling is applied, otherwise we multiply the data by the value provided (before applying any other transformation).
    preprocessing_function: function that will be implied on each input. The function will run before any other modification on it. The function should take one argument: one image (Numpy tensor with rank 3), and should output a Numpy tensor with the same shape.
* data_format: One of {"channels_first", "channels_last"}. "channels_last" mode means that the images should have shape (samples, height, width, channels), "channels_first" mode means that the images should have shape (samples, channels, height, width). It defaults to the image_data_format value found in your Keras config file at ~/.keras/keras.json. If you never set it, then it will be "channels_last".

In [None]:
# Define a augmented data generator
datagen = image.ImageDataGenerator(
                                    rotation_range=10,
                                    width_shift_range=0.1,
                                    height_shift_range=0.1,
                                    zoom_range=0.1,
                                    channel_shift_range=10,
                                    shear_range=0.05,
                                    horizontal_flip=True,
                                    dim_ordering='tf'
                                  )


In [None]:
# Test augmentation on one image
img = np.expand_dims(ndimage.imread(dir_data+'test/unknown/00005.jpg'),0)
gen_aug = datagen.flow(img)

n = 8
(imgs) = [next(gen_aug)[0] for i in range(n)] 
images = imgs[0]


plots(img) #Use utily.py plot function for ease
plots(imgs[:n/2]) #Use utily.py plot function for ease
plots(imgs[n/2:])


## Tracking State

We track data about the training process in a Pandas dataframe

In [None]:
### Load stats if found, otherwise create a blank one
if os.path.exists(fname_stats):
    stats = pd.read_csv(fname_stats)
else:
    stats = pd.DataFrame(columns=['model','epoch','learning_rate','acc','loss','val_acc','val_loss'])

    
display(stats.head())
    

In [None]:
latests_models = []
for m in range(0, ensambles_nb):
    ind = stats.query('model==@m')['epoch'].idxmax()
    latests_models.append(ind)

stats_sum = stats.iloc[latests_models]
display(stats_sum.head())
    

## Model Creation

In [None]:
VGG16 = VGG_16(generators, batch_size=batch_size)

In [None]:
VGG16.model.summary()

In [None]:
conv_layers, fc_layers = split_at(VGG16.model, Flatten)
conv_model = Sequential(conv_layers)
conv_model.summary()

### Precompute Features

In [None]:
features_trn = conv_model.predict_generator(generators['train'], generators['train'].nb_sample, )
features_val = conv_model.predict_generator(generators['valid'], generators['valid'].nb_sample)


In [None]:
save_array(dir_model + 'features_trn_conv.bc', features_trn)
save_array(dir_model + 'features_val_conv.bc', features_val)

In [4]:


features_trn = load_array(dir_model + 'features_trn_conv.bc')
features_val = load_array(dir_model +  'features_val_conv.bc')



In [5]:
labels_trn = to_categorical(generators['train'].classes)
labels_val = to_categorical(generators['valid'].classes)
labels_val[:10]

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.]])

In [None]:
conv_output = features_trn.shape[1:]

In [None]:
def get_fc_layers(p, in_shape):
    return [
        Dense(4096, activation='relu', input_shape=in_shape),
        Dropout(p),
        Dense(4096, activation='relu'),
        Dropout(p),
        Dense(2, activation='softmax')
        ]

In [None]:
fc_model = Sequential(get_fc_layers(0.5, conv_output))
fc_model.summary()

In [None]:
# Transfer Weights from VGG16 model
for l1,l2 in zip(fc_model.layers, fc_layers): 
    l1.set_weights(l2.get_weights())



In [None]:
n = 5
display(fc_model.predict(features_val)[:n])
display(VGG16.predict_gen(generators['valid'])[:n])

In [None]:
labels_val.shape

In [None]:
fc_model.compile(optimizer=Adam(lr=0.01),loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
fc_model.save(dir_model+'fc_model.h5')

In [6]:
fc_model = load_model(dir_model+'fc_model.h5')

In [7]:
fc_model.fit(features_trn[:2000], labels_trn[:2000], nb_epoch = 1, batch_size=1,
                                 validation_data = (features_val, labels_val))

Train on 2000 samples, validate on 2000 samples
Epoch 1/1


MemoryError: Error allocating 411041792 bytes of device memory (out of memory).
Apply node that caused the error: GpuDot22Scalar(GpuDimShuffle{1,0}.0, GpuElemwise{Composite{((i0 * Composite{Switch(i0, (i1 * i2), i3)}(i1, i2, i3, i4)) + (i0 * Composite{Switch(i0, (i1 * i2), i3)}(i1, i2, i3, i4) * sgn(i5)))}}[(0, 2)].0, HostFromGpu.0)
Toposort index: 112
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), TensorType(float32, scalar)]
Inputs shapes: [(25088, 1), (1, 4096), ()]
Inputs strides: [(1, 0), (0, 1), ()]
Inputs values: ['not shown', 'not shown', array(0.10000002384185791, dtype=float32)]
Outputs clients: [[GpuElemwise{Composite{((i0 * i1) + i2)}}[(0, 1)](GpuDimShuffle{x,x}.0, <CudaNdarrayType(float32, matrix)>, GpuDot22Scalar.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

In [None]:
models = []
for m in range(0, ensambles_nb):
    models.append(VGG_16(generators, batch_size=24))
    fname_model =  '{}weights_model{:02d}.h5'.format(dir_model, m)
    if os.path.exists(fname_model):
        models[m].model.load_weights(fname_model)
        print( 'Loaded Weights from: {}'.format(fname_model) )
    else:
        print('Created new weights: {}'.format(fname_model))
    

In [None]:
training = np.array([
#    [0.030, 2],
#    [0.010, 2],
    [0.001, 1],
])
m=0
for model in models:
    print('Training model {:02d}'.format(m))
    for step in training:
        model.lr = step[0]
        for epoch in range(0,int(step[1])):
            
            epochs_current = stats.query('model==@m')['epoch'].max() + 1
            
            # if we didn't find anything in the dataframe, it must be a new model
            if np.isnan(epochs_current):
                epochs_current = 0
            
            print('Training epoch {} at {}'.format(epochs_current, model.lr))

            # Train single epoch
            hist = model.fit_gen(nb_epoch=1)
            
            # Update stats
            stats_slug = {}
            
            # Convert results array to float
            for key in hist.history:
                hist.history[key] = float(hist.history[key][0])

            # Add learning parameters
            stats_slug.update({'model': m,
                              'epoch': epochs_current,
                              'learning_rate': model.lr})
            
            ## Add accuracy
            stats_slug.update(hist.history)

            stats = stats.append(stats_slug, ignore_index=True)
            
            #['model','epochs','learning_rate','loss', 'acc','loss_val','acc_val']
            
            model.model.save_weights(fname_model)
            
    # Go to next model
    m += 1

In [None]:
stats.to_csv(fname_stats, index=False)
display(stats.head(10))

## Explore Trained model

In [None]:
classes = model.classes
classes

In [None]:
# Get predictions for validation set
preds_valid = pred_ensamble(models, generators[1])
fnames_valid = np.array(generators[1].filenames)
#strip category folder
#fnames = np.array([f[f.find('/')+1:] for f in fnames])

display(preds_valid[:10])
display(fnames_valid[:10])

## Precomputing up to the last layer

In [None]:

model.model.pop()
model.compile()

In [None]:
preds = model.predict_gen(model.gen_valid)
display(preds[:5])

(imgs, labels) = next(model.gen_train)

n = 4
imgs = imgs[:4]
labels = labels[:4]
plots(imgs, titles=labels) #Use utily.py plot function for ease


print(model.predict(imgs))


In [None]:
# Convert predictions into a label
labels_pred = np.round(preds_valid[:,1]) #Get probality it is dog

labels_pred[:10]

In [None]:
# get labels
labels_actual = generators[1].classes
labels_actual[:10]


In [None]:
# plot confusion matrix

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(labels_actual, labels_pred)

plot_confusion_matrix(cm, classes, normalize=True)

In [None]:
from keras.preprocessing import image

valid_path = dir_data + 'valid/'
#Helper function to plot images by index in the validation set 
#Plots is a helper function in utils.py
def plots_idx(idx, titles=None):
    plots([image.load_img(valid_path + fnames_valid[i]) for i in idx], titles=titles)

n_view = 6

In [None]:
#1. A few correct predictions at random
correct = np.where(labels_actual==labels_pred)[0]
print "Found %d correct labels" % len(correct)
idx = permutation(correct)[:n_view]
plots_idx(idx, preds_valid[idx])

In [None]:
#1. A few incorrect predictions at random
correct = np.where(labels_actual!=labels_pred)[0]
print "Found %d correct labels" % len(correct)
idx = permutation(correct)[:n_view]
plots_idx(idx, preds_valid[idx])

In [None]:
#3a. The images we most confident were cats, and are actually cats
correct_cats = np.where((labels_pred==0) & (labels_pred==labels_actual))[0]
print "Found %d confident correct cats labels" % len(correct_cats)
most_correct_cats = np.argsort(preds_valid[correct_cats])[::-1][:n_view]
plots_idx(correct_cats[most_correct_cats], preds_valid[correct_cats][most_correct_cats])

In [None]:
len(models)

## Output Submission

Format Kaggle requires for submissions:
```
    imageId,isDog
    1242, .3984
    3947, .1000
    4539, .9082
    2345, .0000
```

In [None]:
def pred_ensamble(models, generator):
    pred_test = 0
    
    for model in models:
        pred_test += model.predict_gen(generator)
        
    pred_test /= len(models)
    
    return pred_test

In [None]:
# Run model on test data and get predictions
pred_test = pred_ensamble(models, generators[2])


display(pred_test[:5])


In [None]:
classes = model.classes
display(classes)

In [None]:
# Grab dog predictions
isDog = pred_test[:,1]

#Get imageids, then strip category folder and extension
imageId = np.array(model.gen_test.filenames)
imageId = np.array([f[f.find('/')+1:] for f in imageId]) #strip category folder
imageId = np.array([f[:f.find('.')] for f in imageId]) #strip filename

display(isDog[:5])
display(imageId[:5])

### Kaggle Evaluation

Kaggle uses categorical log loss defined as:

$$\textrm{LogLoss} = - \frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right]$$
- $n$ is the number of images in the test set
- $\hat{y}_i$  is the predicted probability of the image being a dog
- $y_i$ is 1 if the image is a dog, 0 if cat
- $log()$ is the natural (base e) logarithm

As shown in the plot below, there is a "infinte" penality for predicting the wrong label with high confidence, i.e. predicting 0 when it should be 1. A trick to improve kaggle score is to clip the confident predictions.

The clipping amount is random 

In [None]:
# Lets plot the log loss for the case that the image is a dog, i.e. y_i = 1
from sympy import symbols, log
from sympy import plot
import math

y = symbols('y')
loss = - ( 1*log(y) + (1-y)*log(1-y) )

plot(loss, (y, 0, 1), xlabel='Prediction', ylabel='Log Loss');

In [None]:
clipping = 0.05

isDog = isDog.clip(min=clipping, max=1-clipping)
display(isdog[:5])

### Create Submission

In [None]:
# Compile results into a Pandas Dataframe
subm = pd.DataFrame() 
subm.insert(0,"imageId",imageId) # insert id to the first column
subm.insert(1,"isDog",isDog) # insert predictions
display(subm.head(5))

In [None]:
from datetime import datetime


fname_submission_timestapped = '%s_%s.csv' % ( dir_submissions+fname_submission, datetime.now().strftime('%Y%m%d_%H%M%S'))

display(fname_submission_timestapped)

subm.to_csv(fname_submission_timestapped, index=False)