# Task Report

## Introduction

The goal of this task is to evaluate whether using an autoencoder as the input to a classifier improves the classifier performance. The hypothesis is that as an autoencoder is trained to learn the features that enable it to reproduce its input, these features would be the most salient and when used as the input to a classifier would improve its performance.

The dataset used for this task was CIFAR-10. A further constraint was that only half of the training samples for the following classes were to be used
* bird
* deer
* truck

while the percentage of training samples for the other classes that could be uesd was unconstrained.

## System Information
The training was performed on a Google Cloud Platform instance using the Pytorch library as the deep learning framework.
```
GPU: Tesla P4 (8GB RAM)
CUDA Version 10.0
Python Version: 3.7.3
Pytorch Version: 1.1.0
```

## Training strategy

It was left open whether to train the autoencoder and classifier seperately, or to train them together.
I decided to train them seperately because to train them together would require the overall loss to be a linear combination of the autoencoder loss and the classifier loss. The relative weightings of these losses would require an additional hyperparameter. I felt that this may complicate the task further so I decided to train the autoencoder seperately and then use it's encoder layers in the classifier. 
When training the classifier, the layers provided by the autoencoder were frozen, so that only the layers unique to the classifier would be trained.

I used as much training data as possible, i.e. I used 100% of the training data for the classes that were not specified to use only half. The test data was split into two, half of the samples were used for validataion and the other half for testing.

Rather than decide on a number of epochs for training, the training continued until the validation metric did not improve for 5 consecutive epochs. The parameters at the epoch that gave the best validation metric were restored into the model.

Also, due to time and resource constraints, while some effort was made to get as decently performant a system as possible, the emphasis was not on getting close to state of the art results in the classification performance. Rather, the effort was concentrated on seeing what difference, if any, the use of the autoencoder made. 

## Autoencoder architecture

The final architecture of the autoencoder was decided upon after a few experiments. I added the self-imposed constraint that the number of features at the hidden layer of the autoencoder should be less than the total number of input features. Although encoders with more hidden features than input features do exist, and are regularized by denoising and other methods, in my opinion this defeats the purpose of an autoencoder, which is to find a smaller set of features to adequately summarize the input. Constraining the number of hidden features to be less than the number of input features also adds as a regularizer, and it was noted that with this constraint no overfitting was seen.

In my experiments, various architectures were tried, but in the end the one that gave the best test loss was a relatively simple one.

* CNN Layer 3 -> 32 (with RELU and Batch Normalization)
* CNN Layer 32 -> 32 (with RELU and Batch Normalization) 
* Max Pooling 2x2
* CNN Layer 32 -> 32 (with RELU and Batch Normalization)
* Max Pooling 2x2

* Dense Layer 2048 -> 2048 (with RELU and Batch Normalization)

This gave 2048 features at the hidden layer.
The decoder portion was the same in the reverse order with Max Pooling replaced by Upscaling. However the final layer of the Decoder used a Sigmoid activation rather than a Relu activation so that the output features would more easily map to the desired $[0, 1]$ interval.

A Mean Squared Error was used as the loss function for the autoencoder.

## Classifier Structure

The classifier is initialised with the encoder layers of the autoencoder. On top of these further dense layer are added

* Dense 2048 -> 256 (with RELU and Batch Normalization) Dropout 0.5
* Dense 256 -> 10 (with RELU and Batch Normalization)

A Cross Entropy Error was used as the loss function for the classifier.

## Autoencoder Training

When training the autoencoder, noise of zero mean and 0.3 standard deviation was added to the image as further regularization technique. Images were also horizontally flipped with a probability of 0.5. In some experiments dropout was added, however this led to a deterioration in performance, so for the final architecture no dropout was used.

![title](./autoencoder_losses.png)

I next examined a sample of test images before and after the autoencoder.

#### Input images sample
![title](./input_grid.png)

#### Reproduction
![title](./output_grid.png)

The reproduced images have low contrast and relatively high smoothing out of the original image. They are not particularly pleasing to the eye. 
The output images from an autoencoder without the dense layer at the end of the encoding stage are shown below. 

![title](./output_grid_no_linear.png)

These images have less extreme smoothing and better contrast. However, despite their better appearance using this autoencoder gave worse classifier results. 
A hypothesis for why this happens is that going to a 1-d layer loses positional information that is useful for reconstructing the image, but that the information gained is more useful for the classification task. However, more investigation is needed to confirm this.

I also examined the loss per class:


| Class | Loss |
| --- | --- |
| airplane  | 0.10
| automobile  | 0.19
| bird  | 0.13
| cat  | 0.17
| deer  | 0.15
| dog  | 0.16
| frog  | 0.18
| horse  | 0.16
| ship  | 0.12
| truck  | 0.16
| **Mean** | **0.15**

The classes that were represented less in the training did not generally suffer with worse loss so I felt there was no need to compensate for the class imbalance when training the autoencoder. 


## Classifier Training

To compensate for the imbalance of training classes weights were used when calculating the cross entropy loss so that the underrepresented classes would be trained more strongly than if no weighting were used.

```
def get_cross_entropy_weights():
    weights = np.array([2.0 if c in reduced_train_labels else 1.0 for c in class_labels])
    return torch.Tensor(weights/weights.sum()) #Normalize
```

The losses from training the final classifier are as follows

![title](./auto_class_losses.png)

The best validation score was achieved at epoch 20, so the parameters at that stage were used for testing.

The test accuracy per class were

| Class | Test Accuracy |
| --- | --- |
| airplane | 86.75%
| automobile | 88.19%
| bird | 64.92%
| cat | 65.63%
| deer | 68.13%
| dog | 66.40%
| frog | 86.29%
| horse | 87.43%
| ship | 89.88%
| truck | 86.07%
| **Overall** | **78.84%**

While the accuracy for the bird and deer class are both less than the average, the truck class accuracy is above average, so it appears that the addition of weights to the cross entropy loss counteracted the imbalance in training examples, to some extent at least.

## Comparison with non-autoencoder Classifier

To see whether the autoencoder did make an improvement to the classification performance, another classifier with the exact same architecture was trained. The only difference is that the autoencoder weights were not used for the encoder stages; instead these layers were initialized randomly and trained from scratch.

![title](./ctrl_class_losses.png)

The final results for this classifier were

| Class | Test Accuracy |
| --- | --- |
| airplane | 77.85%
| automobile | 87.37%
| bird | 68.55%
| cat | 70.49%
| deer | 75.00%
| dog | 68.41%
| frog | 89.00%
| horse | 85.43%
| ship | 92.91%
| truck | 89.81%
| **Overall** | **80.42%


As can be seen this classifier gave better results, meaning that, in this case at least, using an autoencoder did not provide a performance boost.

## Conclusion

We were unable to achieve an improvement in classification accuracy by using an autoencoder. However, this does not necessarily mean that autoencoders are not useful. In this case, the test data and training data were from the same domain, so the classifier used as the control could optimize the lower layers for this specific task.
However, if the autoencoder were trained on a larger set of training data and the classifier trained on a narrower domain, it is possible that the autoencoder would help give better performance.

