# Miniproject 1: Image Classification

## Introduction

### Description

One of the deepest traditions in learning about deep learning is to first [tackle the exciting problem of MNIST classification](http://deeplearning.net/tutorial/logreg.html). [The MNIST database](https://en.wikipedia.org/wiki/MNIST_database) (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that was [recently extended](https://arxiv.org/abs/1702.05373). We break with this tradition (just a little bit) and tackle first the related problem of classifying cropped, downsampled and grayscaled images of house numbers in the [The Street View House Numbers (SVHN) Dataset](http://ufldl.stanford.edu/housenumbers/).


### Prerequisites

- You should have a running installation of [tensorflow](https://www.tensorflow.org/install/) and [keras](https://keras.io/).
- You should know the concepts "multilayer perceptron", "stochastic gradient descent with minibatches", "training and validation data", "overfitting" and "early stopping".

### What you will learn

- You will learn how to define feedforward neural networks in keras and fit them to data.
- You will be guided through a prototyping procedure for the application of deep learning to a specific domain.
- You will get in contact with concepts discussed later in the lecture, like "regularization", "batch normalization" and "convolutional networks".
- You will gain some experience on the influence of network architecture, optimizer and regularization choices on the goodness of fit.
- You will learn to be more patient :) Some fits may take your computer quite a bit of time; run them over night.

### Evaluation criteria

The evaluation is (mostly) based on the figures you submit and your answer sentences. 
We will only do random tests of your code and not re-run the full notebook.

### Your names

Before you start, please enter your full name(s) in the field below; they are used to load the data. The variable student2 may remain empty, if you work alone.

In [50]:
student1 = "Nicolas Lesimple"
student2 = "Nakka Krishna"

## Some helper functions

For your convenience we provide here some functions to preprocess the data and plot the results later. Simply run the following cells with `Shift-Enter`.

### Dependencies and constants

In [51]:
import numpy as np
import time
import matplotlib.pyplot as plt
import scipy.io

import keras
from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPooling2D, Dropout, Flatten, Activation, Input
from keras.optimizers import SGD, Adam
from keras.layers.normalization import BatchNormalization
from keras.models import Model
from keras import regularizers



# you may experiment with different subsets, 
# but make sure in the submission 
# it is generated with the correct random seed for all exercises.
# np.random.seed(hash(student1 + student2) % 2**32)
np.random.seed(237699 + 279617)
subset_of_classes = np.random.choice(range(10), 5, replace = False)
print("Subset of classes selected: {}".format(', '.join([str(subset_of_classes[i]) for i in range(5)])))

Subset of classes selected: 1, 0, 5, 6, 3


### Plotting

In [52]:
from pylab import rcParams
rcParams['figure.figsize'] = 10, 6
def plot_some_samples(x, y = [], yhat = [], select_from = [], 
                      ncols = 6, nrows = 4, xdim = 16, ydim = 16,
                      label_mapping = range(10)):
    """plot some input vectors as grayscale images (optionally together with their assigned or predicted labels).
    
    x is an NxD - dimensional array, where D is the length of an input vector and N is the number of samples.
    Out of the N samples, ncols x nrows indices are randomly selected from the list select_from (if it is empty, select_from becomes range(N)).
    
    Keyword arguments:
    y             -- corresponding labels to plot in green below each image.
    yhat          -- corresponding predicted labels to plot in red below each image.
    select_from   -- list of indices from which to select the images.
    ncols, nrows  -- number of columns and rows to plot.
    xdim, ydim    -- number of pixels of the images in x- and y-direction.
    label_mapping -- map labels to digits.
    
    """
    fig, ax = plt.subplots(nrows, ncols)
    if len(select_from) == 0:
        select_from = range(x.shape[0])
    indices = np.random.choice(select_from, size = min(ncols * nrows, len(select_from)), replace = False)
    for i, ind in enumerate(indices):
        thisax = ax[i//ncols,i%ncols]
        thisax.matshow(x[ind].reshape(xdim, ydim), cmap='gray')
        thisax.set_axis_off()
        if len(y) != 0:
            j = y[ind] if type(y[ind]) != np.ndarray else y[ind].argmax()
            thisax.text(0, 0, (label_mapping[j]+1)%10, color='green', 
                                                       verticalalignment='top',
                                                       transform=thisax.transAxes)
        if len(yhat) != 0:
            k = yhat[ind] if type(yhat[ind]) != np.ndarray else yhat[ind].argmax()
            thisax.text(1, 0, (label_mapping[k]+1)%10, color='red',
                                             verticalalignment='top',
                                             horizontalalignment='right',
                                             transform=thisax.transAxes)
    return fig

def prepare_standardplot(title, xlabel):
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.suptitle(title)
    ax1.set_ylabel('categorical cross entropy')
    ax1.set_xlabel(xlabel)
    ax1.set_yscale('log')
    ax2.set_ylabel('accuracy [% correct]')
    ax2.set_xlabel(xlabel)
    return fig, ax1, ax2

def finalize_standardplot(fig, ax1, ax2):
    ax1handles, ax1labels = ax1.get_legend_handles_labels()
    if len(ax1labels) > 0:
        ax1.legend(ax1handles, ax1labels)
    ax2handles, ax2labels = ax2.get_legend_handles_labels()
    if len(ax2labels) > 0:
        ax2.legend(ax2handles, ax2labels)
    fig.tight_layout()
    plt.subplots_adjust(top=0.9)

def plot_history(history, title):
    fig, ax1, ax2 = prepare_standardplot(title, 'epoch')
    ax1.plot(history.history['loss'], label = "training")
    ax1.plot(history.history['val_loss'], label = "validation")
    ax2.plot(history.history['acc'], label = "training")
    ax2.plot(history.history['val_acc'], label = "validation")
    finalize_standardplot(fig, ax1, ax2)
    return fig


### Loading and preprocessing the data

The data consists of RGB color images with 32x32 pixels, loaded into an array of dimension 32x32x3x(number of images). We convert them to grayscale (using [this method](https://en.wikipedia.org/wiki/SRGB#The_reverse_transformation)) and we downsample them to images of 16x16 pixels by averaging over patches of 2x2 pixels.

With these preprocessing steps we obviously remove some information that could be helpful in classifying the images. But, since the processed data is much lower dimensional, the fitting procedures converge faster. This is an advantage in situations like here (or generally when prototyping), were we want to try many different things without having to wait too long for computations to finish. After having gained some experience, one may want to go back to work on the 32x32 RGB images.


In [53]:
# convert RGB images x to grayscale using the formula for Y_linear in https://en.wikipedia.org/wiki/Grayscale#Colorimetric_(perceptual_luminance-preserving)_conversion_to_grayscale
def grayscale(x):
    x = x.astype('float32')/255
    x = np.piecewise(x, [x <= 0.04045, x > 0.04045], 
                        [lambda x: x/12.92, lambda x: ((x + .055)/1.055)**2.4])
    return .2126 * x[:,:,0,:] + .7152 * x[:,:,1,:]  + .07152 * x[:,:,2,:]

def downsample(x):
    return sum([x[i::2,j::2,:] for i in range(2) for j in range(2)])/4

def preprocess(data):
    gray = grayscale(data['X'])
    downsampled = downsample(gray)
    return (downsampled.reshape(16*16, gray.shape[2]).transpose(),
            data['y'].flatten() - 1)


data_train = scipy.io.loadmat('housenumbers/train_32x32.mat')
data_test = scipy.io.loadmat('housenumbers/test_32x32.mat')

x_train_all, y_train_all = preprocess(data_train)
x_test_all, y_test_all = preprocess(data_test)
print("FullTrain data size: {}\nFull Test data size: {}".format(x_train_all.shape,x_test_all.shape))

FullTrain data size: (73257, 256)
Full Test data size: (26032, 256)


### Selecting a subset of classes

We furter reduce the size of the dataset (and thus reduce computation time) by selecting only the 5 (out of 10 digits) in subset_of_classes.

In [54]:
def extract_classes(x, y, classes):
    indices = []
    labels = []
    count = 0
    for c in classes:
        tmp = np.where(y == c)[0]
        indices.extend(tmp)
        labels.extend(np.ones(len(tmp), dtype='uint8') * count)
        count += 1
    return x[indices], labels

x_train, y_train = extract_classes(x_train_all, y_train_all, subset_of_classes)
x_test, y_test = extract_classes(x_test_all, y_test_all, subset_of_classes)

print("Sampled Train data size: {}\nSampled Test data size: {}".format(x_train.shape,x_test.shape))

Sampled Train data size: (43226, 256)
Sampled Test data size: (15767, 256)


Let us plot some examples now. The green digit at the bottom left of each image indicates the corresponding label in y_test.
For further usage of the function plot_some_samples, please have a look at its definition in the plotting section.

In [55]:
plot_some_samples(x_test, y_test, label_mapping = subset_of_classes);

To prepare for fitting we transform the labels to one hot coding, i.e. for 5 classes, label 2 becomes the vector [0, 0, 1, 0, 0] (python uses 0-indexing).

In [56]:
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)

## Exercise 1: No hidden layer

### Description

Define and fit a model without a hidden layer. 

1. Use the softmax activation for the output layer.
2. Use the categorical_crossentropy loss.
3. Add the accuracy metric to the metrics.
4. Choose stochastic gradient descent for the optimizer.
5. Choose a minibatch size of 128.
6. Fit for as many epochs as needed to see no further decrease in the validation loss.
7. Plot the output of the fitting procedure (a history object) using the function plot_history defined above.
8. Determine the indices of all test images that are misclassified by the fitted model and plot some of them using the function 
   `plot_some_samples(x_test, y_test, yhat_test, error_indices, label_mapping = subset_of_classes)`


Hints:
* Read the keras docs, in particular [Getting started with the Keras Sequential model](https://keras.io/getting-started/sequential-model-guide/).
* Have a look at the keras [examples](https://github.com/keras-team/keras/tree/master/examples), e.g. [mnist_mlp](https://github.com/keras-team/keras/blob/master/examples/mnist_mlp.py).

### Solution

In [57]:
####################################################################################################
# Model of the exercice and plot
####################################################################################################
model = Sequential()
model.add(Dense(5, activation='softmax', input_shape=x_train.shape[1:]))

model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=128,
                    epochs=450,
                    verbose=1,
                    validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
plot_history(history, 'Learning Curve and Accuracy')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_180 (Dense)            (None, 5)                 1285      
Total params: 1,285
Trainable params: 1,285
Non-trainable params: 0
_________________________________________________________________
Train on 43226 samples, validate on 15767 samples
Epoch 1/450
Epoch 2/450
Epoch 3/450
Epoch 4/450
Epoch 5/450
Epoch 6/450
Epoch 7/450
Epoch 8/450
Epoch 9/450
Epoch 10/450
Epoch 11/450
Epoch 12/450
Epoch 13/450
Epoch 14/450
Epoch 15/450
Epoch 16/450
Epoch 17/450
Epoch 18/450
Epoch 19/450
Epoch 20/450
Epoch 21/450
Epoch 22/450
Epoch 23/450
Epoch 24/450
Epoch 25/450
Epoch 26/450
Epoch 27/450
Epoch 28/450
Epoch 29/450
Epoch 30/450
Epoch 31/450
Epoch 32/450
Epoch 33/450
Epoch 34/450
Epoch 35/450
Epoch 36/450
Epoch 37/450
Epoch 38/450
Epoch 39/450
Epoch 40/450
Epoch 41/450
Epoch 42/450
Epoch 43/450
Epoch 44/450
Epoch 45/450
Epoch 46/450
Epoch 47/450
Epoch 48/450
Epo

Epoch 61/450
Epoch 62/450
Epoch 63/450
Epoch 64/450
Epoch 65/450
Epoch 66/450
Epoch 67/450
Epoch 68/450
Epoch 69/450
Epoch 70/450
Epoch 71/450
Epoch 72/450
Epoch 73/450
Epoch 74/450
Epoch 75/450
Epoch 76/450
Epoch 77/450
Epoch 78/450
Epoch 79/450
Epoch 80/450
Epoch 81/450
Epoch 82/450
Epoch 83/450
Epoch 84/450
Epoch 85/450
Epoch 86/450
Epoch 87/450
Epoch 88/450
Epoch 89/450
Epoch 90/450
Epoch 91/450
Epoch 92/450
Epoch 93/450
Epoch 94/450
Epoch 95/450
Epoch 96/450
Epoch 97/450
Epoch 98/450
Epoch 99/450
Epoch 100/450
Epoch 101/450
Epoch 102/450
Epoch 103/450
Epoch 104/450
Epoch 105/450
Epoch 106/450
Epoch 107/450
Epoch 108/450
Epoch 109/450
Epoch 110/450
Epoch 111/450
Epoch 112/450
Epoch 113/450
Epoch 114/450
Epoch 115/450
Epoch 116/450
Epoch 117/450
Epoch 118/450
Epoch 119/450
Epoch 120/450
Epoch 121/450
Epoch 122/450
Epoch 123/450


Epoch 124/450
Epoch 125/450
Epoch 126/450
Epoch 127/450
Epoch 128/450
Epoch 129/450
Epoch 130/450
Epoch 131/450
Epoch 132/450
Epoch 133/450
Epoch 134/450
Epoch 135/450
Epoch 136/450
Epoch 137/450
Epoch 138/450
Epoch 139/450
Epoch 140/450
Epoch 141/450
Epoch 142/450
Epoch 143/450
Epoch 144/450
Epoch 145/450
Epoch 146/450
Epoch 147/450
Epoch 148/450
Epoch 149/450
Epoch 150/450
Epoch 151/450
Epoch 152/450
Epoch 153/450
Epoch 154/450
Epoch 155/450
Epoch 156/450
Epoch 157/450
Epoch 158/450
Epoch 159/450
Epoch 160/450
Epoch 161/450
Epoch 162/450
Epoch 163/450
Epoch 164/450
Epoch 165/450
Epoch 166/450
Epoch 167/450
Epoch 168/450
Epoch 169/450
Epoch 170/450
Epoch 171/450
Epoch 172/450
Epoch 173/450
Epoch 174/450
Epoch 175/450
Epoch 176/450
Epoch 177/450
Epoch 178/450
Epoch 179/450
Epoch 180/450
Epoch 181/450
Epoch 182/450
Epoch 183/450
Epoch 184/450

KeyboardInterrupt: 

In [None]:
####################################################################################################
# Question 8 : Determine the indices of all test images that are misclassified by the fitted model
####################################################################################################

##################################################
# We found two ways of doing that
##################################################

####### First way #######
prediction=model.predict(x_test, batch_size=None, verbose=0)
idx=[]
error=0
for i in range (y_test[:,0].size):
    for j in range (y_test[0,:].size):
        if ((y_test[i][j]==1) & (max(prediction[i])!=prediction[i][j])):
            idx.append(i)
            error=error+1
            break
print ('First way : Number of errors: ',error)
### plot some of them using the function ###
plot_some_samples(x_test, y_test, prediction, idx[0:30]  ,label_mapping = subset_of_classes)
plt.show()

######## Second way ########
idx_2=[]
error_2=0
for i in range (len(prediction)):
    max_y_test=np.argmax(y_test[i])
    max_pred=np.argmax(prediction[i])
    if max_y_test!=max_pred :
        idx_2.append(i)
        error_2=error_2+1
print ('Second way : Number of errors: ',error_2)
### plot some of them using the function ###
plot_some_samples(x_test, y_test, prediction, idx_2[0:30]  ,label_mapping = subset_of_classes)
plt.show()

## Exercise 2: One hidden layer, different optizimizers
### Description

Train a network with one hidden layer and compare different optimizers.

1. Use one hidden layer with 64 units and the 'relu' activation. Use the [summary method](https://keras.io/models/about-keras-models/) to inspect your model.
2. Fit the model for 50 epochs with different learning rates of stochastic gradient descent and answer the question below.
3. Replace the stochastic gradient descent optimizer with the [Adam optimizer](https://keras.io/optimizers/#adam).
4. Plot the learning curves of SGD with a reasonable learning rate together with the learning curves of Adam in the same figure. Take care of a reasonable labeling of the curves in the plot.

### Solution

In [None]:
####################################################################################################
# Question 2 : Fit model with different Learning Rate
####################################################################################################

### Learning Rate ###
l_r=[0.001,0.01,0.05,0.1,0.5,1]

####################################################################################################
# Simulation for the different learning rate
####################################################################################################
for i in range (len(l_r)):
    model_1 = Sequential()
    model_1.add(Dense(64, activation='relu', input_shape=x_train.shape[1:]))
    model_1.add(Dense(5, activation='softmax'))

    # Inspect the model #
    model_1.summary()

    model_1.compile(loss='categorical_crossentropy',
              optimizer=keras.optimizers.SGD(lr=l_r[i]),
              metrics=['accuracy'])

    history_1 = model_1.fit(x_train, y_train,
                        batch_size=128,
                        epochs=50,
                        verbose=0,
                        validation_data=(x_test, y_test))

    score = model_1.evaluate(x_test, y_test, verbose=0)

    # plot #
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])
    plot_history(history_1, 'SGD (l_r={0}) : Learning Curve and Accuracy'.format(l_r[i]))
    plt.show()

**Question**: What happens if the learning rate of SGD is A) very large B) very small? Please answer A) and B) with one full sentence (double click this markdown cell to edit).

**Answer**:

A) Very Large : The validation curve is sensitive to a noise. In fact with a huge learning rate, the set of validation data is not enough big to allow the validation curve to escape the noise. Moreover, we can see a difference in validation and training which is normal (we want to see that). Large learning rate will create an oscillation for the loss function around the minimum.  It is really not optimal because due to this large learning rate, the convergence will not be found. On the contrary, it can create a divergence. 

B) Very small : We can see that there are no difference between training and validation curve. In fact, with 50 epochs and a little learning rate, the model has not the time to learn well. The learning rate is small and thus as the changes between epochs will be too small, it will converge very slowly. We will need a huge number of epochs to converge to the best values and thus this is not optimal.


In [None]:
####################################################################################################
# SGD for the graph with lr=0.05 : Question 4
####################################################################################################
model_sgd = Sequential()
model_sgd.add(Dense(64, activation='relu', input_shape=x_train.shape[1:]))
model_sgd.add(Dense(5, activation='softmax'))

model_sgd.summary()

model_sgd.compile(loss='categorical_crossentropy',
          optimizer=keras.optimizers.SGD(lr=0.05),
          metrics=['accuracy'])

history_sgd = model_sgd.fit(x_train, y_train,
                    batch_size=128,
                    epochs=50,
                    verbose=0,
                    validation_data=(x_test, y_test))

score = model_sgd.evaluate(x_test, y_test, verbose=0)

print('Test loss:', score[0])
print('Test accuracy:', score[1])
plot_history(history_sgd, 'SGD l_r= 0.05 : Learning Curve and Accuracy')

In [None]:
####################################################################################################
# ADAM OPTIMIZER Question 3 
####################################################################################################
model_adam = Sequential()
model_adam.add(Dense(64, activation='relu', input_shape=x_train.shape[1:]))
model_adam.add(Dense(5, activation='softmax'))

model_adam.summary()

model_adam.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

history_adam = model_adam.fit(x_train, y_train,
                    batch_size=128,
                    epochs=50,
                    verbose=0,
                    validation_data=(x_test, y_test))

score = model_adam.evaluate(x_test, y_test, verbose=0)

print('Test loss:', score[0])
print('Test accuracy:', score[1])
plot_history(history_adam, 'Adam : Learning Curve and Accuracy')

In [None]:
####################################################################################################
# Question 4: Graph of SGD and ADAM 
####################################################################################################

##################################################
# Plot with the 2 model
##################################################
def plot_history_2(history_adam,history_sgd, title):
    fig, ax1, ax2 = prepare_standardplot(title, 'epoch')
    ax1.plot(history_adam.history['loss'], label = "Training_loss_adam")
    ax1.plot(history_adam.history['val_loss'], label = "Validation_loss_adam")
    ax2.plot(history_adam.history['acc'], label = "Training_accuracy_adam")
    ax2.plot(history_adam.history['val_acc'], label = "Validation_accuracy_adam")
    ax1.plot(history_sgd.history['loss'], label = "Training_loss_sgd_lr=0.05")
    ax1.plot(history_sgd.history['val_loss'], label = "Validation_loss_sgd_lr=0.05")
    ax2.plot(history_sgd.history['acc'], label = "Training_accuracy_sgd_lr=0.05")
    ax2.plot(history_sgd.history['val_acc'], label = "Validation_accuracy_sgd_lr=0.05")
    finalize_standardplot(fig, ax1, ax2)
    return fig

plot_history_2(history_adam,history_sgd,'Learning Curve and Accuracy for SGD and Adam optimizer')

## Exercise 3: Overfitting and early stopping with Adam

### Description

Run the above simulation with Adam for sufficiently many epochs (be patient!) until you see clear overfitting.

1. Plot the learning curves of a fit with Adam and sufficiently many epochs and answer the questions below.

A simple, but effective mean to avoid overfitting is early stopping, i.e. a fit is not run until convergence but stopped as soon as the validation error starts to increase. We will use early stopping in all subsequent exercises.

### Solution

In [None]:
####################################################################################################
# ADAM OPTIMIZER
####################################################################################################
model_adam = Sequential()
model_adam.add(Dense(64, activation='relu', input_shape=x_train.shape[1:]))
model_adam.add(Dense(5, activation='softmax'))

model_adam.summary()

model_adam.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

history_adam = model_adam.fit(x_train, y_train,
                    batch_size=128,
                    epochs=200,
                    verbose=0,
                    validation_data=(x_test, y_test))

score = model_adam.evaluate(x_test, y_test, verbose=0)

print('Test loss:', score[0])
print('Test accuracy:', score[1])
plot_history(history_adam, 'Adam optimizer : Loss and Accuracy')

In [None]:
####################################################################################################
# EARLY STOPPING : ADAM OPTIMIZER
####################################################################################################
model_early = Sequential()
model_early.add(Dense(64, activation='relu', input_shape=x_train.shape[1:]))
model_early.add(Dense(5, activation='softmax'))

model_early.summary()

call= keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0.001, patience=5, verbose=0, mode='auto')

model_early.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

history_early = model_early.fit(x_train, y_train,  callbacks=[call],
                    batch_size=128,
                    epochs=300,
                    verbose=0,
                    validation_data=(x_test, y_test))

score = model_early.evaluate(x_test, y_test, verbose=0)

print('Test loss:', score[0])
print('Test accuracy:', score[1])
plot_history(history_early, 'Adam Optimizer with Early Stopping : Loss and Accuracy ')

**Question 1**: At which epoch (approximately) does the model start to overfit? Please answer with one full sentence.

**Answer**: The model start to overfit around the 50 epoch

**Question 2**: Explain the qualitative difference between the loss curves and the accuracy curves with respect to signs of overfitting. Please answer with at most 3 full sentences.

**Answer**:  In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably". An overfitted model is a statistical model that contains more parameters than can be justified by the data.The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.
Thus we can see overfitting when the training accuracy inscrease since the validation accuracy stays the same. Moreover, we can see that overfitting begin when validation categorical cross entropy starts to increase after a rapid decrease since the training categorical cross entropy decrease. In fact theses two signs means that the training becomes more and more precise for the training set but it has no effect or bad effect on the validation set. It means that the model corresponds too closely or exactly to a particular set of data, the training data. This is overfitting.


## Exercise 4: Model performance as a function of number of hidden neurons

### Description

Investigate how the best validation loss and accuracy depends on the number of hidden neurons in a single layer.

1. Fit a reasonable number of models with different hidden layer size (between 10 and 1000 hidden neurons) for a fixed number of epochs well beyond the point of overfitting.
2. Collect some statistics by fitting the same models as in 1. for multiple initial conditions. Hints: 1. If you don't reset the random seed, you get different initial conditions each time you create a new model. 2. Let your computer work while you are asleep.
3. Plot summary statistics of the final validation loss and accuracy versus the number of hidden neurons. Hint: [boxplots](https://matplotlib.org/examples/pylab_examples/boxplot_demo.html) (also [here](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.boxplot.html?highlight=boxplot#matplotlib.axes.Axes.boxplot)) are useful. You may also want to use the matplotlib method set_xticklabels.
4. Plot summary statistics of the loss and accuracy for early stopping versus the number of hidden neurons.

### Solution

In [None]:
####################################################################################################
# Question 1 : Fit a reasonable number of models with different hidden layer size 
####################################################################################################

#10#
model_10 = Sequential()
model_10.add(Dense(10, activation='relu', input_shape=x_train.shape[1:]))
model_10.add(Dense(5, activation='softmax'))
model_10.summary()
model_10.compile(loss='categorical_crossentropy',
          optimizer=keras.optimizers.SGD(lr=0.1),
          metrics=['accuracy'])
history_10 = model_10.fit(x_train, y_train,
                    batch_size=128,
                    epochs=200,
                    verbose=0,
                    validation_data=(x_test, y_test))

score = model_10.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

#50#
model_50 = Sequential()
model_50.add(Dense(50, activation='relu', input_shape=x_train.shape[1:]))
model_50.add(Dense(5, activation='softmax'))
model_50.summary()
model_50.compile(loss='categorical_crossentropy',
          optimizer=keras.optimizers.SGD(lr=0.1),
          metrics=['accuracy'])
history_50 = model_50.fit(x_train, y_train,
                    batch_size=128,
                    epochs=200,
                    verbose=0,
                    validation_data=(x_test, y_test))
score = model_50.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

#100#
model_100 = Sequential()
model_100.add(Dense(100, activation='relu', input_shape=x_train.shape[1:]))
model_100.add(Dense(5, activation='softmax'))
model_100.summary()
model_100.compile(loss='categorical_crossentropy',
          optimizer=keras.optimizers.SGD(lr=0.1),
          metrics=['accuracy'])
history_100 = model_100.fit(x_train, y_train,
                    batch_size=128,
                    epochs=200,
                    verbose=0,
                    validation_data=(x_test, y_test))
score = model_100.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

#300#
model_300 = Sequential()
model_300.add(Dense(300, activation='relu', input_shape=x_train.shape[1:]))
model_300.add(Dense(5, activation='softmax'))
model_300.summary()
model_300.compile(loss='categorical_crossentropy',
          optimizer=keras.optimizers.SGD(lr=0.1),
          metrics=['accuracy'])
history_300 = model_300.fit(x_train, y_train,
                    batch_size=128,
                    epochs=200,
                    verbose=0,
                    validation_data=(x_test, y_test))
score = model_300.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

#600#
model_600 = Sequential()
model_600.add(Dense(600, activation='relu', input_shape=x_train.shape[1:]))
model_600.add(Dense(5, activation='softmax'))
model_600.summary()
model_600.compile(loss='categorical_crossentropy',
          optimizer=keras.optimizers.SGD(lr=0.5),
          metrics=['accuracy'])
history_600 = model_600.fit(x_train, y_train,
                    batch_size=128,
                    epochs=200,
                    verbose=0,
                    validation_data=(x_test, y_test))
score = model_600.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

#1000#
model_1000 = Sequential()
model_1000.add(Dense(1000, activation='relu', input_shape=x_train.shape[1:]))
model_1000.add(Dense(5, activation='softmax'))
model_1000.summary()
model_1000.compile(loss='categorical_crossentropy',
          optimizer=keras.optimizers.SGD(lr=0.1),
          metrics=['accuracy'])
history_1000 = model_1000.fit(x_train, y_train,
                    batch_size=128,
                    epochs=200,
                    verbose=0,
                    validation_data=(x_test, y_test))
score = model_1000.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

##################################################
# Plot with all the curve of the different number of neuron in the hidden layer
##################################################
def plot_history_3(history_10,history_50,history_100,history_300,history_600,history_1000, title):
    fig, ax1, ax2 = prepare_standardplot(title, 'epoch')
    ax1.plot(history_10.history['loss'], ':', color="blue", label = "Training Loss with 10 hidden neurons")
    ax1.plot(history_10.history['val_loss'], '--', color="blue", label = "Validation Loss with 10 hidden neurons")
    ax2.plot(history_10.history['acc'], ':', color="blue",label = "Training Accuracy with 10 hidden neurons")
    ax2.plot(history_10.history['val_acc'],'--', color="blue", label = "Validation Accuracy with 10 hidden neurons")
    ax1.plot(history_50.history['loss'], ':', color="red",label = "Training Loss with 50 hidden neurons")
    ax1.plot(history_50.history['val_loss'], '--', color="red",label = "Validation Loss with 50 hidden neurons")
    ax2.plot(history_50.history['acc'], ':', color="red",label = "Training Accuracy with 50 hidden neurons")
    ax2.plot(history_50.history['val_acc'], '--', color="red",label = "Validation Accuracy with 50 hidden neurons")
    ax1.plot(history_100.history['loss'], ':', color="green",label = "Training Loss with 100 hidden neurons")
    ax1.plot(history_100.history['val_loss'], '--', color="green",label = "Validation Loss with 100 hidden neurons")
    ax2.plot(history_100.history['acc'], ':', color="green",label = "Training Accuracy with 100 hidden neurons")
    ax2.plot(history_100.history['val_acc'], '--', color="green",label = "Validation Accuracy with 100 hidden neurons")
    ax1.plot(history_300.history['loss'], ':', color="orange",label = "Training Loss with 300 hidden neurons")
    ax1.plot(history_300.history['val_loss'], '--', color="orange",label = "Validation Loss with 300 hidden neurons")
    ax2.plot(history_300.history['acc'], ':', color="orange",label = "Training Accuracy with 300 hidden neurons")
    ax2.plot(history_300.history['val_acc'], '--', color="orange",label = "Validation Accuracy with 300 hidden neurons")
    ax1.plot(history_600.history['loss'], ':', color="yellow",label = "Training Loss with 600 hidden neurons")
    ax1.plot(history_600.history['val_loss'], '--', color="yellow",label = "Validation Loss with 600 hidden neurons")
    ax2.plot(history_600.history['acc'], ':', color="yellow",label = "Training Accuracy with 600 hidden neurons")
    ax2.plot(history_600.history['val_acc'], '--', color="yellow",label = "Validation Accuracy with 600 hidden neurons")
    ax1.plot(history_1000.history['loss'], ':', color="black",label = "Training Loss with 1000 hidden neurons")
    ax1.plot(history_1000.history['val_loss'], '--', color="black",label = "Validation Loss with 1000 hidden neurons")
    ax2.plot(history_1000.history['acc'], ':', color="black",label = "Training Accuracy with 1000 hidden neurons")
    ax2.plot(history_1000.history['val_acc'], '--', color="black",label = "Validation Accuracy with 1000 hidden neurons")
    finalize_standardplot(fig, ax1, ax2)
    return fig

In [None]:
##################################################
# Plot all the model in the same plot
##################################################
plot_history_3(history_10,history_50,history_100,history_300,history_600,history_1000,'Learning Curve and Accuracy')
plt.show()

In [None]:
####################################################################################################
# Question 1 : You can see all the graph of the different network with the different numbers of hidden neuron 
####################################################################################################

##################################################
# Hiden Layer 
##################################################
hidden=[10,50,100,300,600,1000]

####################################################################################################
# Simulation of the network for different number of hidden neuron 
####################################################################################################
for i in range (len(hidden)):
    model= Sequential()
    model.add(Dense(hidden[i], activation='relu', input_shape=x_train.shape[1:]))
    model.add(Dense(5, activation='softmax'))
    model.summary()
    model.compile(loss='categorical_crossentropy',
              optimizer=keras.optimizers.SGD(lr=0.05),
              metrics=['accuracy'])
    history= model.fit(x_train, y_train,
                        batch_size=128,
                        epochs=100,
                        verbose=0,
                        validation_data=(x_test, y_test))
    plot_history(history, 'Hidden Layer of {0} neurons : Learning Curve and Accuracy'.format(l_r[i]))
    plt.show()

In [None]:
####################################################################################################
# Question 3 : Plot summary statistics of the final validation loss and accuracy versus the number of hidden neurons. 
####################################################################################################
import math

start = time.time()
print("Start of the script execution...")

##################################################
# Hiden Layer 
##################################################
hidden=[10,50,100,300,600,1000]

##################################################
# Number of simulation with different initial condtion
##################################################
times=5

##################################################
# Data for plot 
##################################################
data_val_loss=np.empty((times,len(hidden)))
data_val_accuracy=np.empty((times,len(hidden)))
data_train_loss=np.empty((times,len(hidden)))
data_train_acc=np.empty((times,len(hidden)))

####################################################################################################
# Model with different number of hidden neuron and with different initial condtition
####################################################################################################

# To have different initial conditions, we need to remake a model, and thanks to the seed, initial condtions would had change #
for i in range (len(hidden)):
        for k in range (times):
            print('hidden =',hidden[i],' and k =',k)
            model= Sequential()
            model.add(Dense(hidden[i], activation='relu', input_shape=x_train.shape[1:], name='layer1'))
            model.add(Dense(5, activation='softmax', name='layer2'))
            model.summary()
            model.compile(loss='categorical_crossentropy',
                      optimizer=keras.optimizers.SGD(lr=0.05),
                      metrics=['accuracy'])
            history= model.fit(x_train, y_train,
                                batch_size=128,
                                epochs=250,
                                verbose=0,
                                validation_data=(x_test, y_test))
            score = model.evaluate(x_test, y_test, verbose=0)
            print('Test loss:', score[0])
            print('Test accuracy:', score[1])
            data_val_loss[k][i]= (history.history['val_loss'][-1])
            data_val_accuracy[k][i]=(history.history['val_acc'][-1])
            data_train_loss[k][i]=(history.history['loss'][-1])
            data_train_acc[k][i]=(history.history['acc'][-1])

print("End of the execution !")
end = time.time()
print("Execution time : {} seconds.".format(math.ceil(end - start)))

##################################################
# Plot 
##################################################

plt.figure()
plt.boxplot(data_val_loss)
plt.title('Boxplot : Validation Loss Plot')
plt.xticks([1,2,3,4,5,6], ['10','50','100','300','600','1000'])
plt.xlabel('Number of hidden neurons')
plt.ylabel('Final Validation Loss')

plt.figure()
plt.boxplot(data_val_accuracy)
plt.title('Boxplot : Validation Accuracy Plot')
plt.xticks([1,2,3,4,5,6], ['10','50','100','300','600','1000'])
plt.xlabel('Number of hidden neurons')
plt.ylabel('Final Validation Accuracy')

plt.show()

In [None]:
####################################################################################################
# Question 4 : Plot summary statistics of the loss and accuracy for early stopping versus the number of hidden neurons.
####################################################################################################

import math

start = time.time()
print("Start of the script execution...")

### Hiden Layer ###
hidden=[10,50,100,300,600,1000]

### Number of simulation with different initial condtion ###
times=5

### Data for plot ###
data_val_loss=np.empty((times,len(hidden)))
data_val_accuracy=np.empty((times,len(hidden)))
data_train_loss=np.empty((times,len(hidden)))
data_train_acc=np.empty((times,len(hidden)))


### Model with different number of hidden neuron and with different initial condtition ###
for i in range (len(hidden)):
    for k in range(times):
        print('hidden =',hidden[i],' and k =',k)
        model= Sequential()
        model.add(Dense(hidden[i], activation='relu', input_shape=x_train.shape[1:]))
        model.add(Dense(5, activation='softmax'))
        model.summary()
        call= keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0.001, patience=5, verbose=0, mode='auto')
        model.compile(loss='categorical_crossentropy',
                  optimizer=keras.optimizers.SGD(lr=0.05),
                  metrics=['accuracy'])
        history= model.fit(x_train, y_train, callbacks=[call],
                            batch_size=128,
                            epochs=250,
                            verbose=0,
                            validation_data=(x_test, y_test))
        data_val_loss[k][i]= (history.history['val_loss'][-1])
        data_val_accuracy[k][i]=(history.history['val_acc'][-1])
        data_train_loss[k][i]=(history.history['loss'][-1])
        data_train_acc[k][i]=(history.history['acc'][-1])
        
print("End of the execution !")
end = time.time()
print("Execution time : {} seconds.".format(math.ceil(end - start)))

### Plot ###

plt.figure()
plt.boxplot(data_val_loss)
plt.title('Boxplot : Validation Loss Plot With Early Stopping')
plt.xticks([1,2,3,4,5,6], ['10','50','100','300','600','1000'])
plt.xlabel('Number of hidden neurons')
plt.ylabel('Final Validation Loss')

plt.figure()
plt.boxplot(data_val_accuracy)
plt.title('Boxplot : Validation Accuracy Plot With Early Stopping')
plt.xticks([1,2,3,4,5,6], ['10','50','100','300','600','1000'])
plt.xlabel('Number of hidden neurons')
plt.ylabel('Final Validation Accuracy')

plt.figure()
plt.boxplot(data_train_loss)
plt.title('Boxplot: Train Loss PLot With Early Stopping')
plt.xticks([1,2,3,4,5,6], ['10','50','100','300','600','1000'])
plt.xlabel('Number of hidden neurons')
plt.ylabel('Final Train Loss')

plt.figure()
plt.boxplot(data_train_acc)
plt.title('Boxplot: Train Accuracy Plot With Early Stopping')
plt.xticks([1,2,3,4,5,6], ['10','50','100','300','600','1000'])
plt.xlabel('Number of hidden neurons')
plt.ylabel('Final Train Accuracy')

plt.show()

## Exercise 5: Comparison to deep models

### Description

Instead of choosing one hidden layer (with many neurons) you experiment here with multiple hidden layers (each with not so many neurons).

1. Fit models with 2, 3 and 4 hidden layers with approximately the same number of parameters as a network with one hidden layer of 100 neurons. Hint: Calculate the number of parameters in a network with input dimensionality N_in, K hidden layers with N_h units, one output layer with N_out dimensions and solve for N_h. Confirm you result with the keras method model.summary().
2. Run each model multiple times with different initial conditions and plot summary statistics of the best validation loss and accuracy versus the number of hidden layers.

### Solution

In [None]:
################################################################
# 1.Summary statistics of models with different number of hidden layer keeping neuron count same as 100 Hidden layer model
#################################################################

data_loss=[]
data_accuracy=[]
times=10;

n_hidden = [100, 77, 66, 59]      # No of hidden neurons per layer
n_repeat = [1, 2, 3, 4]           # No of hidden layers

########################################
# Create parametric model with given hidden and repeat arguments
#########################################

def dynamic_model(neurons, repeat):
    y = Input(shape=x_train.shape[1:])
    input_img = y
    for j in range(repeat):
        y = Dense(neurons, activation='relu') (y)
    y = Dense(5, activation='softmax')(y)
    model = Model(input_img, y)
    return model

models=[]

for j in range(4):
    print("\n\nModel Summary with {} hidden layers".format(j))
    models.append(dynamic_model(n_hidden[j], n_repeat[j]))    
    models[j].summary()

In [None]:
##################################################
# Part 2 - Running experiment for 5 times with different initial conditions for fixed 250 epochs
###################################################


n_hidden = [100, 77, 66, 59]
n_repeat = [1, 2, 3, 4]
total_exps = len(n_hidden)
times = 5
num_epochs = 500

data_val_loss     = np.zeros((times,4))
data_val_accuracy = np.zeros((times,4))
data_train_loss   = np.zeros((times,4))
data_train_acc    = np.zeros((times,4))

summary_naive = {}


######################################################
# We dont use the early stopping criteria, however, we run for enough iterations and save the model with best validation accuracy
# using the modelcheckpoint callback
# In the output summary, you can see the values of best validation loss and accuracy
########################################################

for i in range(total_exps):
    for k in range(times):
        print("\n\n*****************************************************************************")
        print('Running experiment {}/{} : with {} hidden layers ({} neurons per hidden layer)'.format(k+1,times, i+1, n_hidden[i]))
        model = dynamic_model(n_hidden[i], n_repeat[i])
        model.summary()
        #call= keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0.001, patience=15, verbose=0, mode='auto')
        saveBestModel = keras.callbacks.ModelCheckpoint("./best_weights.hdf5", monitor='val_acc', verbose=0, save_best_only=True, mode='auto',save_weights_only=True)
        model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.SGD(lr=0.05), metrics=['accuracy'])
        history= model.fit(x_train, y_train, callbacks=[saveBestModel],batch_size=128, epochs=num_epochs, verbose=0, 
                           validation_data=(x_test, y_test))
        model.load_weights("./best_weights.hdf5")
        score = model.evaluate(x_train, y_train, verbose=0)
        print("Train Loss: {} , Train Accuracy: {}".format(score[0], score[1]))
        score = model.evaluate(x_test, y_test, verbose=0)
        data_val_loss[k][i]= score[0]
        data_val_accuracy[k][i]=score[1]
        print("Valid Loss: {} , Valid Accuracy: {}".format(data_val_loss[k][i],data_val_accuracy[k][i]))
    summary_naive['hidden_{}'.format(n_repeat[i])] = history
        

plt.figure()
plt.boxplot(data_val_loss)
plt.title('Summary Statistics of Best Validation Loss')
plt.xticks([1,2,3,4], ['1','2','3','4'])
plt.xlabel('Number of hidden neurons')
plt.ylabel('Final Validation Loss')

plt.figure()
plt.boxplot(data_val_accuracy)
plt.title('Summary Statistics of Best Validation Accuracy')
plt.xticks([1,2,3,4], ['1','2','3','4'])
plt.xlabel('Number of hidden neurons')
plt.ylabel('Final Validation Accuracy')


plt.show()

## Exercise 6: Tricks (regularization, batch normalization, dropout)

### Description

Overfitting can also be counteracted with regularization and dropout. Batch normalization is supposed to mainly decrease convergence time.

1. Try to improve the best validation scores of the model with 1 layer and 100 hidden neurons and the model with 4 hidden layers. Experiment with batch_normalization layers, dropout layers and l1- and l2-regularization on weights (kernels) and biases.
2. After you have found good settings, plot for both models the learning curves of the naive model you fitted in the previous exercises together with the learning curves of the current version.
3. For proper comparison, plot also the learning curves of the two current models in a third figure.

### Solution

In [None]:
##########################################################################
# adding regularisation to the model in the form of weight decay, dropout and
# also batch normalization for reducing internal co-varaince shift
##########################################################################

n_hidden =[100, 59]
n_repeat =[1, 4]
l2_decay= 1e-4
l1_decay= 1e-4

#########################################################
# Custom architecture with additional dropout, batchnorm
#########################################################
def dynamic_model_regularised(neurons, repeat, adddropout=True, addbatchnorm=True):
    y = Input(shape=x_train.shape[1:])
    input_img = y
    for j in range(repeat):
        if adddropout:
            y = Dropout(0.1)(y)
        y = Dense(neurons, activation=None, kernel_regularizer=regularizers.l2(l2_decay), bias_regularizer=regularizers.l1(l1_decay))(y)
        if addbatchnorm:
            y = BatchNormalization()(y)
        y = Activation('relu')(y)
       
    y = Dense(5, activation='softmax')(y)
    model = Model(input_img, y)
    return model

models=[]
for j in range(2):
    print("\n\nModel Summary with {} hidden layers".format(n_repeat[j]))
    models.append(dynamic_model_regularised(n_hidden[j], n_repeat[j]))    
    models[j].summary()

In [None]:
##############################################################################
# Run model for multiple times and plot the summary statistics of best validation loss and accuracy
##############################################################################
import math

lr_adam= 1e-3
n_hidden =[100, 59]
n_repeat =[1, 4]
total_exps = len(n_hidden)
times= 1
n_exps= len(n_hidden)
num_epochs = 500
print(" Running models with {} hidden layers for {} times\n\n".format(', '.join([str(i) for i in range(len(n_hidden))]), times))

data_val_loss     = np.zeros((times,n_exps))
data_val_accuracy = np.zeros((times,n_exps))
data_train_loss   = np.zeros((times,n_exps))
data_train_acc    = np.zeros((times,n_exps))
summary_reg={}


####################################
# Expriments are run for enough epochs and the best model is used to plot the summary statistics
####################################

for i in range(total_exps):
    for k in range(times):
        print("*****************************************************************************")
        print('\n\nRunning experiment {}/{} : with {} hidden layers ({} neurons per hidden layer)'.format(k+1,times, i+1, n_hidden[i]))
        model=dynamic_model_regularised(n_hidden[i], n_repeat[i])
        model.summary()
        #call= keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0.001, patience=15, verbose=0, mode='auto')
        saveBestModel = keras.callbacks.ModelCheckpoint("./best_weights.hdf5", monitor='val_acc', verbose=0, save_best_only=True, mode='auto',save_weights_only=True)

        model.compile(loss='categorical_crossentropy',
                  optimizer=keras.optimizers.SGD(lr=0.005),metrics=['accuracy'])
        history= model.fit(x_train, y_train, callbacks=[saveBestModel],
                            batch_size=128, epochs=num_epochs, verbose=1, validation_data=(x_test, y_test))
        model.load_weights("./best_weights.hdf5")
        score = model.evaluate(x_train, y_train, verbose=0)
        print("Train Loss: {} , Train Accuracy: {}".format(score[0], score[1]))
        score = model.evaluate(x_test, y_test, verbose=0)
        print("Test loss: {} , Test accuracy: {}".format(score[0],score[1]))
      
    summary_reg['hidden_{}'.format(n_repeat[i])] = history



In [None]:
##################################################
#Part 2 : plots of Regulairsed model vs Naive model
####################################################
title='Learning Curve and Accuracy of Regularised model vs Naive model (1 Hidden layer)'
history_reg=summary_reg['hidden_1']
history_naive=summary_naive['hidden_1']
fig, ax1, ax2 = prepare_standardplot(title, 'epoch')
ax1.plot(history_reg.history['loss'][:250], label = "train reg")
ax1.plot(history_reg.history['val_loss'][:250], label = "val reg")
ax1.plot(history_naive.history['loss'], label = "train naive")
ax1.plot(history_naive.history['val_loss'], label = "val naive")


ax2.plot(history_reg.history['acc'][:250], label = "train reg.")
ax2.plot(history_reg.history['val_acc'][:250], label = "vali reg.")
ax2.plot(history_naive.history['acc'], label = "train naive")
ax2.plot(history_naive.history['val_acc'], label = "val naive")
finalize_standardplot(fig, ax1, ax2)
plt.show()

title='Loss Curve and Accuracy of Regularised model vs Naive model (4 hidden layers)'
history_reg=summary_reg['hidden_4']
history_naive=summary_naive['hidden_4']
fig, ax1, ax2 = prepare_standardplot(title, 'epoch')
ax1.plot(history_reg.history['loss'][:250], label = "train reg")
ax1.plot(history_reg.history['val_loss'][:250], label = "val reg")
ax1.plot(history_naive.history['loss'], label = "train naive")
ax1.plot(history_naive.history['val_loss'], label = "val naive")


ax2.plot(history_reg.history['acc'][:250], label = "train reg.")
ax2.plot(history_reg.history['val_acc'][:250], label = "val reg.")
ax2.plot(history_naive.history['acc'], label = "train naive")
ax2.plot(history_naive.history['val_acc'], label = "val naive")
finalize_standardplot(fig, ax1, ax2)

plt.show()

In [None]:
######################################################
#Plots of the Regularised model only
#######################################################
title='Learning Curve and Accuracy of Regularised (1 Hidden layer)'
history_reg=summary_reg['hidden_1']
history_naive=summary_naive['hidden_1']
fig, ax1, ax2 = prepare_standardplot(title, 'epoch')
ax1.plot(history_reg.history['loss'][:250], label = "train reg")
ax1.plot(history_reg.history['val_loss'][:250], label = "val reg")


ax2.plot(history_reg.history['acc'][:250], label = "train reg.")
ax2.plot(history_reg.history['val_acc'][:250], label = "vali reg.")
finalize_standardplot(fig, ax1, ax2)
plt.show()

title='Learning Curve and Accuracy of Regularised model (4 hidden layers)'
history_reg=summary_reg['hidden_4']
history_naive=summary_naive['hidden_4']
fig, ax1, ax2 = prepare_standardplot(title, 'epoch')
ax1.plot(history_reg.history['loss'][:250], label = "train reg")
ax1.plot(history_reg.history['val_loss'][:250], label = "val reg")


ax2.plot(history_reg.history['acc'][:250], label = "train reg.")
ax2.plot(history_reg.history['val_acc'][:250], label = "val reg.")
finalize_standardplot(fig, ax1, ax2)

plt.show()

### Regularized  model doesn’t seem to improve in spite of all kinds of dropout, regularizations etc.. We think reason being the naive model is already under-fitting  and further adding regularization won’t help much. As a matter of fact, regularization will help if the model is overfitting.

## Exercise 7: Convolutional networks

### Description

Convolutional neural networks have an inductive bias that is well adapted to image classification.

1. Design a convolutional neural network, play with the parameters and fit it. Hint: You may get valuable inspiration from the keras [examples](https://github.com/keras-team/keras/tree/master/examples), e.g. [mnist_cnn](https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py).
2. Plot the learning curves of the convolutional neural network together with the so far best performing model.

## Solution

In [None]:
############################################################################
# Run simple shallow model with a sequene of convolutional and pooling layers
###############################################################################
epochs = 250

img_rows, img_cols = 16, 16

x_train_ = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test_ = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

x_train_ = x_train_.astype('float32')
x_test_  = x_test_.astype('float32')

print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

l2_decay=1e-4
l1_decay=1e-3
lr_adam =1e-3

w_reg = regularizers.l2(l2_decay)
b_reg = regularizers.l1(l1_decay)
summary_conv={}

###########################
# Simple shallow model 
#############################
def conv_model():
    Inp = Input(shape=input_shape)
    y = Conv2D(32, kernel_size=(3, 3), activation='relu', kernel_regularizer=w_reg, bias_regularizer=b_reg)(Inp)
    y = BatchNormalization()(y)
    y = Conv2D(64, (3, 3), activation='relu', kernel_regularizer=w_reg, bias_regularizer=b_reg)(y)
    y = BatchNormalization()(y)
    y = MaxPooling2D(pool_size=(2, 2))(y)
    y = Conv2D(64, (3, 3), activation='relu', kernel_regularizer=w_reg, bias_regularizer=b_reg)(y)
    y = BatchNormalization()(y)
    y = MaxPooling2D(pool_size=(2, 2))(y)
    y = Dropout(0.25)(y)
    y = Flatten()(y)
    y = Dense(128, activation='relu',kernel_regularizer=w_reg, bias_regularizer=b_reg)(y)
    y = Dropout(0.5)(y)
    y = Dense(5, activation='softmax',kernel_regularizer=w_reg, bias_regularizer=b_reg)(y)
    model = Model(Inp,y)
    return model

model=conv_model()
model.summary()

callbacks= keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0.001, patience=5, verbose=0, mode='auto')
saveBestModel = keras.callbacks.ModelCheckpoint("./best_weights.hdf5", monitor='val_acc', verbose=0, save_best_only=True, mode='auto',save_weights_only=True)
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.adam(lr=lr_adam, decay=1e-5), metrics=['accuracy'])
history = model.fit(x_train_, y_train, batch_size = 128,callbacks=[saveBestModel], epochs=epochs, verbose=0, validation_data=(x_test_, y_test))

summary_conv['shallow_conv_model'] =history
model.load_weights("./best_weights.hdf5")
score = model.evaluate(x_train_, y_train, verbose=0)
print("Train Loss: {} , Train Accuracy: {}".format(score[0], score[1]))
score = model.evaluate(x_test_, y_test, verbose=0)
print("Test loss: {} , Test accuracy: {}".format(score[0],score[1]))


In [None]:
###################################
#Part 2- plots of CNN vs deep models
####################################

title='Learning Curve and Accuracy of CNN vs Non-CNN'
history=summary_conv['shallow_conv_model'] 
history_old_best= summary_naive['hidden_4']
fig, ax1, ax2 = prepare_standardplot(title, 'epoch')
ax1.plot(history.history['loss'], label = "train CNN")
ax1.plot(history.history['val_loss'], label = "val CNN")
ax1.plot(history_old_best.history['loss'][:250], label = "train FC")
ax1.plot(history_old_best.history['val_loss'][:250], label = "val FC")


ax2.plot(history.history['acc'], label = "train CNN")
ax2.plot(history.history['val_acc'], label = "val CNN")
ax2.plot(history_old_best.history['acc'][:250], label = "train FC")
ax2.plot(history_old_best.history['val_acc'][:250], label = "val FC")

finalize_standardplot(fig, ax1, ax2)
plt.show()