# Recognition of images of the fashion mnist dataset

## The project


The aim of the project is to use differents techniques of deep learning in order to predict type of clothes of the fashion mnist dataset the more precisely possible.

The dataset is compose of 60 000 images for training and 10 000 images for testing.

<img src="img/Fashion-MNIST-Dataset-Images-with-Labels-and-Description.png">

There are 10 different classes, the neural network will have to predict for an image given what type of class it is.

## The code

### Linear model

We will begin our training by our more simple model : a linear model.

First we need to import the packages we will need :

```python

import tensorflow as tf
import tensorflow.keras as keras
import matplotlib.pyplot as plt

```

Then we need to create our linear model using keras :

```python

def linear_model(x, y, val_x, val_y, opt, loss_func, epochs, batch_size):
    model = keras.Sequential([
        # convert a two dimensional matrix into a vector
        keras.layers.Flatten(),
        keras.layers.Dense(10, activation=keras.activations.softmax),
    ])

    model.compile(optimizer=opt, loss=loss_func, metrics=keras.metrics.categorical_accuracy)

    logs = model.fit(x, y, validation_data=(val_x, val_y), epochs=epochs, batch_size=batch_size,
                     callbacks=[keras.callbacks.LearningRateScheduler(scheduler)])
    model.summary()

    return logs

```

The model take in parameter :

* The training and testing datas
* The function of optimization
* The function for evaluate the loss
* The epochs (number of time the neural network process the entire datset)
* The batch size (number of example given before the neural network corrige the weights

Here we choose for the activation function the softmax because the sum of the output returned is 1 and its good in a categorical problem as it return a pourcentage on how much it thinks an image is a certain type of category or not.

When we fit the model there is an argument called callbacks, what does he do ? This argument call every epochs the function scheduler :

```python

def scheduler(epoch, lr):
    if epoch < 150:
        return lr
    else:
        return lr * 0.9875

```

This function allow the learning rate to be reduced from the 150th iteration. Reduction of the learning rate will allow the neural network to become more and more precise between each epoch(from the 150th).


In the main function :


```python


if __name__ == "__main__":
    # how many time the model will review the training data
    epochs = 300
    # number of data images who spreed through the network (forward propagation), after that the network
    # mean the sum of errors and make only one backpropagation
    # batch size increase the available computational parallelism and make it converge faster to optimum local
    # but algorithm with large batch size will hardly find the minimum global compared to little bach size
    batch_size = 1024

    # get data of training and testing from fashion mnist dataset
    (x_train, y_train), (x_test, y_test) = keras.datasets.fashion_mnist.load_data()

    # pixel have values from 0 to 255, normalize them
    x_train = x_train / 255.0
    x_test = x_test / 255.0

    # transform label (containing a value from O to 9) to matrix of 10 (one hot encoding)
    y_train = keras.utils.to_categorical(y_train, 10)
    y_test = keras.utils.to_categorical(y_test, 10)

    all_logs = []
    log = linear_model(x_train, y_train, x_test, y_test, keras.optimizers.SGD(lr=0.05, momentum=0.95),
                       keras.losses.categorical_crossentropy, epochs=epochs, batch_size=batch_size)

    all_logs.append(log)

    plot_log(all_logs)
    

```
So firstly we introduce hyperparameters epochs and batch_size and set it respectively to 300 and 1024.
A large batch size will allow the network to process the data much faster but there at risk that it converge in global (and not local) optimum.

For the loss function, cross-entropy is used as it is a good function coupled to the softmax functions as it penalized well the deviations between output and predicted values.

The function plot_log allow us to display the loss and accuracy of our models.

After 70 epochs, here are our results :

<img src="img/plot_1_1.png">

<img src="img/plot_1_2.png">

<img src="img/plot_1_3.png">

<img src="img/plot_1_4.png">

As we can see in these graphs, the loss fall down in the first epochs and then decrease a little.
On the training data, the values seems to decrease until the end but on the training data the loss seems to stabilise and even increase at the end. Let's look more carrefuly the datas :

<img src="img/plot_1_5.PNG">

<img src="img/plot_1_6.PNG">

<img src="img/plot_1_7.PNG">

As we can see, the loss on the testing test was as its lowest on the 38th epoch. Then the loss on training test is still increasing a little but the loss on training test decrease over the time.

It suggets that we are strating overfitting, the model start over-learn and can't generalize anymore.


Let's try with a MLP



### Multi Layer Perceptron


The only difference with the previous code is that we add two more layers in the model :

```python3

def multi_layer_perceptron(x, y, val_x, val_y, opt, loss_func, epochs, batch_size):
    model = keras.Sequential([
        # convert a two dimensional matrix into a vector
        keras.layers.Flatten(),
        keras.layers.Dense(60, activation=keras.activations.relu),
        keras.layers.Dense(60, activation=keras.activations.relu),
        keras.layers.Dense(10, activation=keras.activations.softmax),
    ])

    model.compile(optimizer=opt, loss=loss_func, metrics=keras.metrics.categorical_accuracy)

    logs = model.fit(x, y, validation_data=(val_x, val_y), epochs=epochs, batch_size=batch_size,
                     callbacks=[keras.callbacks.LearningRateScheduler(scheduler)])
    model.summary()

    return logs

```

I reduce the number of epochs at 50 as he become useless to train more if the model overfit before the end.

I also change the scheduler function :

```python3

def scheduler(epoch, lr):
    if epoch < 30:
        return lr
    else:
        return lr * 0.98

```

Let's see what are the results :


<img src="img/plot_2_1.png">

<img src="img/plot_2_2.png">

<img src="img/plot_2_3.png">

<img src="img/plot_2_4.png">


As we can see with the plots, the MLP performs much better than the previous model.
For these example i use the activation function relu, what would happen with others activation functions ? Which is the best for this example ? We will see :


<img src="img/plot_3_1.png">

<img src="img/plot_3_2.png">

<img src="img/plot_3_3.png">

<img src="img/plot_3_4.png">


We can see very interesting results, elu, relu, selu and tanh activation function seems pretty similar. Otherwise the sigmoid function tends to work poorly on the earlys epochs but at the end started to surpass the others.


We can see by analysing the error on training and testinf that the model is overfitting the data except on sigmoid.
We have 3 ways to fight overfitting :
* reduce the model complexity
* add more data
* add regularization technics

Adding more data is not possible and between the two last choices i choose to add regularization technics.
let's see what happens if i add dropout.

Dropout is a technic that select random neurons who will be ignored during training. Dropout is use because it reduce the variance of the data.

I have also added a new callback named EarlyStopping, this callback has few advantages :

* prevent the model to overfit by stopping it when it stop improving
* gain calculuse time by stopping it before the end

Let's see the results :

<img src="img/plot_4_1.png">

<img src="img/plot_4_2.png">

<img src="img/plot_4_3.png">

<img src="img/plot_4_4.png">

As we can see, our models have begin to overfit way much atfer that in our previous cases but the performance has not increase.
Let's try to change the architecture of the MLP and pass from 2 to 3 or more hidden layers.

With the arhitecture :
* 28 * 28 neurons as inputs
* first hidden layer 128 neurons
* second hidden layer 64 neurons
* third hidden layer 32 neurons
* 10 neurons as outputs

I obtained the following results : 

<img src="img/plot_5_1.png">

<img src="img/plot_5_2.png">

<img src="img/plot_5_3.png">

<img src="img/plot_5_4.png">

As we can see there are not a lot of changes, finally let's try take the adam optimizer, Adam optimize use adaptive learning rate that allow programs to converge faster on local minima

<img src="img/plot_6_1.png">

<img src="img/plot_6_2.png">

<img src="img/plot_6_3.png">

<img src="img/plot_6_4.png">

So, we converge faster but obtain slightly worse results.

I try to change parameters like lower learning rate but finally we lose the advantage of the fatest convergence.
Adding a fourth layer doesn't help much

So if we look our previous results with stochastic gradient descent, the sigmoid function and selu converge too much lower for pretty much same results on testing data than the others so we remove it.

Now lets run various tests and see the result on testing test:

<img src="img/plot_7_1.png">

<img src="img/plot_7_2.png">

<img src="img/plot_7_3.png">

<img src="img/plot_7_4.png">

Between the three functions, they seems to are pretty the same (first graph relu win, second elu win, third equal and fourth is for tanh) but the time for converging is better for relu so relu seems better in our case.

Testing a lot of time the same algorithm is important because patterns ca be different, for example i test a little more and obtain different results : 

<img src="img/plot_8_1.png">

<img src="img/plot_8_2.png">

<img src="img/plot_8_3.png">

As we see in the first elu was the fastest and the lowest ton converge in second and third graph. It's due to weights initialisation.

After more testing by variate number of size of hidden layers and change dropout i found what one of the best infrastructure for the model :

* 2 hiddens layers of 120 neurons
* relu activation function (as it give best results with tanh and is the one who converge the fatest
* dropout of 20 %

Let's see his comparaison with our previous model the linear model (multi perceptron without hidden layers) :

<img src="img/plot_9_1.png">

<img src="img/plot_9_2.png">

<img src="img/plot_9_3.png">

<img src="img/plot_9_4.png">

As we see our MLP performs much better (89,46 % at his peak) against linear model (84,62%)


### Convolutional neural network


Nows let's try a last and different approach, instead of just process the image directly in the MLP we will before apply to her some changes : It's the convolutional part and pooling part.

<img src="img/convolution_illustration.png">

As we see the image had a sucession of convolution and pooling layer then the data is process into a fully connected layer and the output is a vector with the % of each class (softmax function)

So firstly what is the convolution ?

Convolution is use to extract the features of the images.

Let's see for example the edge filter :

<img src="img/filter_illustration.jpg">

As we can see we apply to the 2 images 2 filters, 1 for vertical edge detection and one other for horizontal edge detection.
And we obtain as output the same image with in blank the edge detected wether horizontal or vertical.

After extracting features we pass the images resulting into a pooling part.

The job of a pooling layer is to reduce the number of pixels of the image in order to decrease the computanional power requires to process data.

There are 2 types of pooling layer : 

* The mean pooling
* The max pooling

<img src="img/pooling_illustration.png">

In our example giving a matrix of 8 * 8 pixels and a step of 2 we will obtain a matrix of 4 * 4 pixels.


Let's see the effect of max pooling on real images.

<img src="img/before_polling.jpg">

<img src="img/after_pooling.jpg">

As we can see, we reduce the size of the image without losing so much information.

Theses images are from the folder named "without_tensor_flow".
If you want to see more about the algorithms used on convolution and pooling layers i recommend you to check the python files "Convolution.py" and "Pooling.py".

After that the resulting images are put into a fully connected layer (MLP)

Let's implement that in TensorFlow

```python

def convolutional_neural_network(x, y, val_x, val_y, opt, loss_func, epochs, batch_size, activation, dropout):
    model = keras.Sequential([
        keras.layers.Reshape((28, 28, 1)),

        keras.layers.Conv2D(32, (3, 3), padding="same", activation=activation),
        keras.layers.MaxPool2D(),
        keras.layers.Dropout(dropout),

        keras.layers.Conv2D(32, (3, 3), padding="same", activation=activation),
        keras.layers.MaxPool2D(),
        keras.layers.Dropout(dropout),

        keras.layers.Conv2D(32, (3, 3), padding="same", activation=activation),
        keras.layers.MaxPool2D(),
        keras.layers.Dropout(dropout),

        keras.layers.Flatten(),

        keras.layers.Dense(10, activation=keras.activations.softmax)
    ])

    model.compile(optimizer=opt, loss=loss_func, metrics=keras.metrics.categorical_accuracy)

    logs = model.fit(x, y, validation_data=(val_x, val_y), epochs=epochs, batch_size=batch_size,
                     callbacks=[keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)])

    model.summary()

    return logs


```

I have first tried with a linear model at the end with no hidden layers, let's see the results : 

<img src="img/plot_10_1.png">

<img src="img/plot_10_2.png">

<img src="img/plot_10_3.png">

<img src="img/plot_10_4.png">

As we can see except for sigmoid results are pretty similar and higher than those obtained with a MLP.
We can also see that there is no overfitting so let's complify a bit more the model by adding hidden couch.

Let's see now what are the results with a hidden couch of 60 neurons, i keep sigmoid in case of a miracle appear and sigmoid work amazingly with the hidden layer.

<img src="img/plot_11_1.png">

<img src="img/plot_11_2.png">

<img src="img/plot_11_3.png">

<img src="img/plot_11_4.png">

The model runs better without the hidden layer, so let's try with only 30 neurons in the hidden layer.

<img src="img/plot_12_1.png">

<img src="img/plot_12_2.png">

<img src="img/plot_12_3.png">

<img src="img/plot_12_4.png">

Here the results are better than ever, let's see what happens if we reduce the dropout from 20 to 10 % :

<img src="img/plot_13_1.png">

<img src="img/plot_13_2.png">

<img src="img/plot_13_3.png">

<img src="img/plot_13_4.png">

With 10 % of dropout, we obtain the best results with a pic of 92.3 % on testing test for relu on the 79th epoch.
As the tanh and relu gives the best performances, we will train our models only on them, that he will allow us to train during more epochs. Let's run them both relu and tanh on 150 epochs and see what happens :

<img src="img/plot_14_1.png">

<img src="img/plot_14_2.png">

<img src="img/plot_14_3.png">

<img src="img/plot_14_4.png">

Relu seems to give the best results compared to tanh, let's see what % of dropout is the best for relu :


<img src="img/plot_15_1.png">

<img src="img/plot_15_2.png">

<img src="img/plot_15_3.png">

<img src="img/plot_15_4.png">

Results between the thrid values are pretty similar, let's run a second test to decide which value of dropout is the best.

<img src="img/plot_15_5.png">

I decide to keep a value of 10 % for the dropout as it gives the best values for both test.



We had previously a batch size of 1024, however the value of the batch size is too high and a lower value could give best values.

Let's run a test with batch size values of 128, 256 and 512 :


<img src="img/plot_16_1.png">

<img src="img/plot_16_2.png">

<img src="img/plot_16_3.png">

<img src="img/plot_16_4.png">

As we can see, a batch size of 256 give the best results. We also see than larger batch size need more epochs to converge before earlyStopping stopped the training. But even with a larger numbers of epochs needed, they are still faster due to GPU parallelism.

So, as we can see, our best results give an accuracy around 0.925.

Now we will use a different technic called data augmentation, the principle is to create images variant of the training's images. 

There are 2 ways of reduce overfitting :
- reduce the model complexity, but in our case it will not increase our accuracy
- add regularisation technics, we already use dropout
- increase the number of data, this is the data augmentation

As we will increase drastically the number of data, regularization technics as dropout is no longuer required.
Moreover, we will use another technic which is not efficient when coupled to dropout : batch normalization.
Batch normalization as explained in the start of the notebook is used to speed-up the training, but the difference is that normalization is used on training data and batch normalization is uded on convolutions matrix.

Batch normalization also have a slight effect of normalization as it add noise to the datas because the batch normalization is effected on batches and not the overall data. It can reduce covariance shift problems.

We use Batch normalization after each convolutional part.

The data augmentation constist of produce new images from on existing image frome the training dataset.
The image will have modifications as :
- Apply a symmetry to the image (horizontal or vertical, vertical in our case)
- The image is zoomed
- The image have a random rotation (not useful in our case beacause images are always centered)
- ...

This article explained very well the various parameters for data augmentation : https://towardsdatascience.com/exploring-image-data-augmentation-with-keras-and-tensorflow-a8162d89b844

```python


train_generator = dataGen_training.flow(x_train, y_train, batch_size=batch_size)

x_valid = x_train[:150 * batch_size]
y_valid = y_train[:150 * batch_size]

valid_steps = x_valid.shape[0] // batch_size
validation_generator = dataGen_testing.flow(x_valid, y_valid, batch_size=batch_size)


```

**Edit** : At the time i create my conv net i thought that 150 was the number of images variants created for each original image and then i understood that images training was variants of original images using data augmentation and 60 000 images with data augmentation would result in 60 000 images and not 60 000 * 150. So here we have 150 * 216 = 38 400 images for testing. It's way too lot and slows down the training. 50 * batch size would have been better.

Here is the architecture i use :

```python


model = keras.Sequential([
        Reshape((28, 28, 1)),
        BatchNormalization(),

        Conv2D(196, (3, 3), padding="same", activation=activation),
        BatchNormalization(),
        Conv2D(196, (3, 3), padding="same", activation=activation),
        BatchNormalization(),
        MaxPool2D(),

        Conv2D(92, (3, 3), padding="same", activation=activation),
        BatchNormalization(),
        Conv2D(92, (3, 3), padding="same", activation=activation),
        BatchNormalization(),
        MaxPool2D(),

        Conv2D(48, (3, 3), padding="same", activation=activation),
        BatchNormalization(),
        Conv2D(48, (3, 3), padding="same", activation=activation),
        BatchNormalization(),

        keras.layers.Flatten(),

        keras.layers.Dense(30, activation=activation),

        BatchNormalization(),

        keras.layers.Dense(10, activation=keras.activations.softmax)
    ])


```

For an unknow reason, 2 convolution layer with half size works better than 1 convolution layer.

```python


model.compile(optimizer=keras.optimizers.Adam(lr=0.02), loss=keras.losses.categorical_crossentropy,
                  metrics=keras.metrics.categorical_accuracy)

logs = model.fit(
    train_generator,
    steps_per_epoch=len(x_train) // batch_size,
    epochs=200,
    callbacks=[keras.callbacks.EarlyStopping(monitor='val_loss', patience=20),
               keras.callbacks.LearningRateScheduler(scheduler)],
    validation_data=validation_generator,
    validation_freq=1,
    validation_steps=valid_steps,
    verbose=2,
)


```

I used Adam, it is a good optimizer who need no tuning and give good results and a very fast convergence.

```python


dataGen_training = ImageDataGenerator(
  rotation_range=10,
  horizontal_flip=False,
  vertical_flip = True,
  width_shift_range=0.1,
  height_shift_range=0.1,
  rescale=1. / 255,
  shear_range=0.05,
  zoom_range=0.05,
)

dataGen_testing = ImageDataGenerator(
  rescale=1. / 255,
)


```

Here is the code for data augmentation, i made some **HUGES mistakes**
At the time i write the code i thougt vertical flip mean vertical symmetry, moreover i copy cut the other lines from my study on CIFAR10 Dataset that you can found here (in french) : https://github.com/eldoria/cifar10/blob/master/Recognition%20of%20image%20on%20cifar-10.ipynb

As all the images are centred, shift_range, shear_range and zoom_range are useless and even counter productive.

The testing is not modified as we must never touch the testing dataset otherwise we will not evaluate properly our model.



Even with this bad configuration i have fantastic results :


<img src="img/plot_17_1.png">

<img src="img/plot_17_2.png">

<img src="img/plot_17_3.png">

<img src="img/plot_17_4.png">

The pic for learning rate is 0.995 for testing set and 0.992 for learning rate.
If I had let the algorithm run more time it would have certainly atteigned near prefect score on training set, testing had more prediction than training because images in training was harder to predict as they are modified.

We can see that the adam optimize make converge VERY fast, but the problem is that after a certain time testing set stop imrpving and for an obscure reason, EarlyStopping never stop the algorithm.

I made a new pyhon file in order to visualize the effects of data augmentation on my images.
Let's see what are the effects of data augmentation with my previous bad configuration :

<img src="img/data_augmentation_1.png">

As we can see, the error on flip complexify too much the data for nothing as it will be on the right order for all the example in testing.
Let's change the rotation and see results : 

<img src="img/data_augmentation_2.png">

Here it gives pretty good results as we have the same image but with small differences each time.
But there is a little problem, it works good for this example but if we test one more time :

<img src="img/data_augmentation_3.png">

We can see in this example that shoes will be reversed, it's not very pertinent as all the shoes will be pointed toward the left in our testing dataset.

I also reduce random rotation from 10 to 5 % and i delete the zoom : 

<img src="img/data_augmentation_4.png">

This is better but the width_shift range makes the image lacks infomration because it goes out of the border and it's useless as CNN are invariant to translation, so let's delete these parameters.

I increase shear_range, useful to deform a little the image : 

<img src="img/data_augmentation_5.png">
<img src="img/data_augmentation_6.png">

We can nicely see the effect of deformation on these 2 examples. So for the changement of the form it's good but it lacks one parameter to play with : the color.

I first add a parameter, channel_shift_range in order to change the intensity of images :

<img src="img/data_augmentation_7.png">

Some other examples :

<img src="img/data_augmentation_8.png">
<img src="img/data_augmentation_9.png">

After theses sucessful tests i was aiming to launch a test but when i compile it compiles the vizualisation script instead of the conv script and i discover an horrific scene :

<img src="img/data_augmentation_10.png">

As we can see, the clothe is black and by playing with the colors intensity it becomes pratically invisible in certain cases.

Various possibilities:
* Accept that a fraction of data will become bad and hope that the neural network will be able to generalize on testing data (who are not modified by data augmentation)

* Decrease the effect of channel_shift_randge, but we will loose the interest of our modifier as light image will stay quite light and dark images will stay quite black

* Find another data augmentation technic who also effects the coloration of images but without degrading dark images

I firstly try to find another technic : brightness_range.
The utilisation of this technic is a little weird, brightness_range take 2 floats as range, the value will be a random number between theses floats range.

* a value inferior to 0 will darken an image
* a value between 0 and 1 seems to makes nothing
* a value superior to 1 will lighten the image


- The tests for a range btw -1 and 0 :

Original image :

<img src="img/shift_ex_1.png">

Results :

<img src="img/shift_ex_2.png">



- The tests for a range btw 0 and 1:

Original image :

<img src="img/shift_ex_3.png">

Results :

<img src="img/shift_ex_4.png">

Note here that there is one missing image, due to the fact that one of the images have a value paratically equal to 0



- The tests for a range btw 1 and 2 :

Original image :

<img src="img/shift_ex_5.png">

Results :

<img src="img/shift_ex_6.png">

Here we see images lighten pretty well.