In [None]:
%%html
<!-- This cell makes the font bigger to make it easy to read. Adjust to taste -->
<style>
.cell, .CodeMirror pre{ 
    font-size: 100%;
    line-height: 100%;
}
</style>

# COSC470 Assignment 2, 2018

## Name: Hannah Clark-Younger
## Due Date: Monday September 24th

For assignment 2 you need to implement machine learning algorithm(s) to label faces according to:
- sex (male/female)
- age (child/teen/adult/senior)
- expression (smiling/serious)

A data set from MIT is made available, along with code to read the images and labels into `numpy` arrays. 
These arrays are divided into training, validation, and testing data sets.

You may use any machine learning algorithms you like to classify the faces.
Techniques you may find useful that we've looked at include:
- Decision trees and random forests
- Boosting (and AdaBoost in particular)
- Support Vector Machines (SVMs)
- Face detection (to focus on the key parts of the image)
- EigenFaces (for dimensionality reduction)
- Neural networks in TensorFlow
- CNNs in TensorFlow

## Submission Requirements

You should submit a version of this Notebook renamed to `YourName.ipynb`, so my submission would be `StevenMills.ipynb`. 
You can assume that the same libraries that are available in the COSC470 Anaconda environment on the lab machines are available.
In particular, you can use numpy, scipy, OpenCV, and TensorFlow.

I should be able to open your Notebook and run it. The Notebook should contain the code to construct and train your classifier(s) from the training data (using the validation data appropriately) and then to compute the labels of the training data through a call to `computeLabels`, which has a stub implementation at the end of this notebook. 

## Marking Scheme

A rough marking scheme is given below. This is intentionally fairly open, so that I can give you marks for doing good stuff without having to predetermine what stuff is good.

- 10 marks for the discussion of choice of algorithms and training strategy
- 10 marks for the explanation and clear implementation
- 5 marks for performance

### Algorithm Choice and Training

I will be looking for a description of the algorithm(s) chosen, why you chose that approach, and how you developed, trained and evaluated your method.
You should think about issues such as how to best make use of the training and validation data and how to select parameters for your chosen method.

You are not restricted to a single classifier or method. If you find it useful to determine age labels first and then use that to help determine expression, then that is fine. If you want to use an SVM for sex classification, but a boosted classifier for age, that's also fine.
However, you should discuss why you chose to use the methods you have chosen.

### Explanation and Clear Implementation

You should implement your chosen algorithm(s) using the training and validation data sets provided. 
Jupyter notebooks let you interleave discussion and code, so you should clearly describe how your implementation works.
You can include mathematics if needed using \\(\LaTeX\\)-style markup as demonstrated in the lecture notebooks.
I'll be looking for clear implementations that illustrate good practice in training and evaluation.

It is expected that you will make appropriate use of libraries such as OpenCV and TensorFlow where appropriate, but your explanation should your understanding of these tools clear. 
For example, if you choose to use a convolutional network, you should explain your architecture, how it relates to the code, and give some justification for the various parameters that you need to select when making a CNN.

### Performance

The last cell of the notebook has a function that takes a face data set and produces labels as a result.
You should modify this so that it uses your machine learning algorithms to generate the labels.
I will then use these labels to compare your results to the ground truth.
I may also shuffle the training, validation, and testing data sets around before running your code.

# The Data Set

The following code reads the data into training, testing, and validation sets.
It assumes that the `.zip` of labelled face data set from the course website has been unzipped into the same directory as the notebook.
There are 1997 training images, and 998 each test and training images.

In [None]:
import numpy as np

# Read in training data and labels

# Some useful parsing functions

# male/female -> 0/1
def parseSexLabel(string):
    if (string.startswith('male')):
        return 0
    if (string.startswith('female')):
        return 1
    print("ERROR parsing sex from " + string)


# child/teen/adult/senior -> 0/1/2/3
def parseAgeLabel(string):
    if (string.startswith('child')):
        return 0
    if (string.startswith('teen')):
        return 1
    if (string.startswith('adult')):
        return 2
    if (string.startswith('senior')):
        return 3
    print("ERROR parsing age from " + string)


# serious/smiling -> 0/1
def parseExpLabel(string):
    if (string.startswith('serious')):
        return 0
    if (string.startswith('smiling') or string.startswith('funny')):
        return 1
    print("ERROR parsing expression from " + string)


# Count number of training instances

numTraining = 0

for line in open("MITFaces/faceDR"):
    if line.find('_missing descriptor') < 0:
        numTraining += 1

dimensions = 128 * 128

trainingFaces = np.zeros([numTraining, dimensions])
trainingSexLabels = np.zeros(numTraining)  # Sex - 0 = male; 1 = female
trainingAgeLabels = np.zeros(numTraining)  # Age - 0 = child; 1 = teen; 2 = male
trainingExpLabels = np.zeros(numTraining)  # Expression - 0 = serious; 1 = smiling

index = 0
for line in open("MITFaces/faceDR"):
    if line.find('_missing descriptor') >= 0:
        continue
    # Parse the label data
    parts = line.split()
    trainingSexLabels[index] = parseSexLabel(parts[2])
    trainingAgeLabels[index] = parseAgeLabel(parts[4])
    trainingExpLabels[index] = parseExpLabel(parts[8])
    # Read in the face
    fileName = "MITFaces/rawdata/" + parts[0]
    fileIn = open(fileName, 'rb')
    trainingFaces[index, :] = np.fromfile(fileIn, dtype=np.uint8, count=dimensions) / 255.0
    fileIn.close()
    # And move along
    index += 1

# Count number of validation/testing instances

numValidation = 0
numTesting = 0

# Assume they're all Validation
for line in open("MITFaces/faceDS"):
    if line.find('_missing descriptor') < 0:
        numTraining += 1
    numValidation += 1

# And make half of them testing
numTesting = int(numValidation / 2)
numValidation -= numTesting

validationFaces = np.zeros([numValidation, dimensions])
validationSexLabels = np.zeros(numValidation)  # Sex - 0 = male; 1 = female
validationAgeLabels = np.zeros(numValidation)  # Age - 0 = child; 1 = teen; 2 = male
validationExpLabels = np.zeros(numValidation)  # Expression - 0 = serious; 1 = smiling

testingFaces = np.zeros([numTesting, dimensions])
testingSexLabels = np.zeros(numTesting)  # Sex - 0 = male; 1 = female
testingAgeLabels = np.zeros(numTesting)  # Age - 0 = child; 1 = teen; 2 = male
testingExpLabels = np.zeros(numTesting)  # Expression - 0 = serious; 1 = smiling

index = 0
for line in open("MITFaces/faceDS"):
    if line.find('_missing descriptor') >= 0:
        continue

    # Parse the label data
    parts = line.split()
    if (index < numTesting):
        testingSexLabels[index] = parseSexLabel(parts[2])
        testingAgeLabels[index] = parseAgeLabel(parts[4])
        testingExpLabels[index] = parseExpLabel(parts[8])
        # Read in the face
        fileName = "MITFaces/rawdata/" + parts[0]
        fileIn = open(fileName, 'rb')
        testingFaces[index, :] = np.fromfile(fileIn, dtype=np.uint8, count=dimensions) / 255.0
        fileIn.close()
    else:
        vIndex = index - numTesting
        validationSexLabels[vIndex] = parseSexLabel(parts[2])
        validationAgeLabels[vIndex] = parseAgeLabel(parts[4])
        validationExpLabels[vIndex] = parseExpLabel(parts[8])
        # Read in the face
        fileName = "MITFaces/rawdata/" + parts[0]
        fileIn = open(fileName, 'rb')
        validationFaces[vIndex, :] = np.fromfile(fileIn, dtype=np.uint8, count=dimensions) / 255.0
        fileIn.close()

    # And move along
    index += 1
print("Data loaded,", str(trainingFaces.shape[0]), "training images")

# DEEP CNN FOR FACE CLASSIFICATION

Over the past half-decade or so, Convolutional Neural Networks (CNNs) have consistently been shown to outperform all other methods of machine learning for classification tasks [1] [2] [3] [4]. Given this is a set of three classification tasks, I decided that it was a good bet for this project. CNNs are also simple to implement (in the sense that they don't require complicated engineering) and often generalise well to unseen data after training.

The work with CNNs is in picking the hyperparameters: the number of layers, the number of features each layer can detect, the window size of these features, the learning rate and the batch size and so forth. There is no robust theory that instructs us how to do this in a principled way - it is often largely based on trial and error (and experience).

Before we launch into the methods used, some analysis of the classification tasks and what would count as 'good' accuracy is necessary. We can consider two kinds of naive classifiers. The first outputs a class at random, and the second has the capacity to learn to output the most common class for all images. In the first case, we should expect 50% accuracy on the Sex and Expression tasks, and 25% accuracy on the Age task. To work out how the second case would perform, we need to know something about the actual distribution of classes. The distribution of the classes over the entire set of images, as well as that over the training, validation, and testing sets I used are given below. It is the testing set that is particularly relevant because that is the set on which the classifiers will be assessed.

<table>
<tr><th>Sex </th><th>Age</th><th>Expression</th></tr>
<tr><td>

Class | male | female | 
--- | --- | --- |
Training | 57.6% | 42.4% |
Validation | 66.7% | 33.3% |
**Testing** | **61.4%** | **38.6%** |
Overall | 60.8% | 39.2% |

</td><td>
    
Class | child | teen | adult | senior |
--- | --- | --- | --- | --- |
Training | 12.2% | 13.1% | 72.0% | 2.7% |
Validation | 6.3% | 8.0% | 84.3% | 1.4% |
**Testing** | **0.9%** | **0.3%** | **88.7%** | **10.1%** |
Overall | 7.9% | 8.6% | 79.3% | 4.2% |

</td><td>    
      
Class | serious | smiling | 
--- | --- | --- |
Training | 45.9% | 54.1% |
Validation | 50.0% | 50.0% |
**Testing** | **60.1%** | **39.9%** |
Overall | 50.5% | 49.5% |

</td></tr> </table>

The second strategy, then, will result in testing accuracy of 61.4% for the Sex task, 88.7% for the Age task, and 39.9% for the Expression task (because there are more images classified as "smiling" in the training set). This means that for the Sex task, we need to improve on 50% to beat the random classifier, and 61.4% to beat the most-common-class classifier. For the Age task, we need to improve on 25% and 88.7% respectively, and for the Expression task we need to improve on 50% and 39.9% respectively.


## Convolutional Neural Networks

Artificial neural networks, inspired by the mammalian brain, have been shown to be very effective at various kinds of learning tasks, such as classifying images by their content or navigating a robot around the world. After the initial excitement over them in the 1940s to 1960s (see [5] [6] [7]), there was a period referred to as the "dark ages" which was spurred by the discovery that, in their simple two-layer form, they are limited to learning tasks that are linearly separable [8]. However, since the conceptual introduction of the generalised delta rule (in 1986) and thus the possibility of having a third (or in fact any number of), "hidden", layer(s) [9], the study of neural networks has revived, and indeed exploded.

Neural networks are implemented as directed graphs: nodes, or "neurons," with weighted connections. They learn by updating the values of the weights of each connection, which produces different levels of activation in each neuron when it is triggered by some input. Depending on the type of artificial neural network, there may be more restrictions on the architecture - that is, the number and arrangement of these neurons and connections.

Deep CNNs have been increasingly popular since 1998 [10], particularly on image classification tasks similar to this one [11] [12]. CNNs are feed-forward (input->layers->output with no loops)neural networks with a particular kind of architecture: they have several layers, including convolutional layers and subsampling layers. The convolutional layers respond to small patches of pixels, rather than to individual pixels. This means that they can preserve the information contained in the relative location of the pixels -- for example, a horizontal line consists of similar coloured pixels lined up next to one another, not randomly scattered around the image. Each convolutional layer looks for a specified number of "features" (perhaps horizontal lines, or patches of blue pixels), and it can detect them wherever they may occur in the image. Subsampling (max-pooling is commonly used) layers often follow convolutional layers, and compress or "pool" the information extracted by the convolutional layers by keeping track of (in the case of max-pooling) only the strongest candidate for each feature in a given region, or (in the case of average-pooling) the average activation for that feature in the given region. In theory, as the information passes through the network, the features that are detected become more complex. In the earliest layers, the features are basic lines and blobs, and the subsequent layers recombine these features to make increasingly complex features.

CNNs often use Rectified Linear Units (ReLU) as the activation function (this defines how the inputs to a given unit are combined to produce the activation of the unit). ReLU is a simple activation function, which takes the input (the sum of the units from the previous layer that are connected to the unit in question, weighted by the trainable weights associated with the respective connections, with the trainable bias of the unit added), and gives as the activation of the unit either this input, or 0, whichever is the highest. This mitigates the vanishing gradient problem because the slope is always 1 (when positive): it enables the weights in earlier layers to train faster than they would if a different activation function were used (sigmoid, for example). All of this means that CNNs are very effective at *learning* useful features, detecting them, and combining them to form representations of types of objects so that they can identify them in images.


## My CNN

I first implemented a basic CNN with three convolutional and three maxpooling layers. I tried some different numbers of features at each layer, but at its best it reached testing accuracy of 76% for the Sex classification task, 86% for the Age classification task, and 74% for the Expression classification task. While this is significantly above baseline for Sex and Expression (though not for Age), I believed it was possible to do better than that. Since four weeks is not enough time for extensive trial and error, and I don't have much experience, I chose to stand on the shoulders of giants and implement an architecture that was designed by others with much more of both.

In recent years, some of the top performing network architectures on similar classification tasks have been AlexNet [1], VGGNet [2], GoogLeNet (Inception) [3], and ResNet[4]. I started with VGGNet (architecture described below), as it is much simpler than its successors, without any of their bells and whistles, such as inception modules and residual modules. As far as vanilla CNNs go, it can be argued that VGGNet is still the best there is. I also implemented a version of ResNet (this can be found at [13]), but I found that it didn't do as well as VGGNet on these classification tasks, while taking significantly longer to train.

I trained all my CNNs using the training set (of size 1997), checking the progress by assessing the performance on the validation set after each epoch (each full run through of the training data, with updates to the weights occurring after each mini-batch of size 16). The validation set wasn't used to update weights, just to check its performance on data it isn't training on. When it reached peak performance on validation data, I saved the network's current state and tested it on the testing data, which had previously not been seen by the network at all. Testing was done by running the model and directly getting the accuracy out of it. Alternatively, it can be computed by using the computeLabels method (below) and comparing the predicted labels to the true labels, which is essentially what the model does to compute accuracy anyway. All the results I present are on testing data. Performance on training data easily reached 100% in all cases, and was similar on validation data as on testing data. But it is the testing performance that matters most, because this measures how well the CNN can generalise what it has learned to unfamiliar data. Validation accuracy is what we use to decide when to stop training, so it will be, by definition, at its highest at the point we stop and test.


## Data augmentation

First, I trained my VGGNets (more details below) on the training data as it came. I achieved testing accuracy of 83% (Sex), 91% (Age), and 85% (Exp) (these results given more precisely below). This was notably better than my first attempt at a basic CNN. However, in an attempt to boost performance even more, as the training data set was only 1997 images (very small for CNNs) I decided to try augmenting the training data. In order to increase the size of the training set, as well as the variation among the examples in this set, I augmented each image in three different ways (separately, producing ten versions of each image including the original). Each image was, four separate times, rotated a random amount between -25 and 25 degrees. Each also had noise added, four separate times. Each was also flipped horizontally (this can only be done once). Including the original image as well, that multiplied the training data by 10, giving 19970 training images. I used methods described in [14] to help me do this.

Running the next cell performs this data augmentation. You may need to install scikit-image to the working environment for this to work. If this is not possible, this step can be skipped and everything else will still run, and the accuracy will be similar to that given above, and below in Results (pre-augmenting).

In [None]:
import random
from scipy import ndarray
import skimage as sk ### possibly needs to be installed, not in the existing cosc470 environment
from skimage import transform
from skimage import util
import copy

def random_rotation(image_array):
    random_degree = random.uniform(-25, 25)
    return sk.transform.rotate(image_array, random_degree)

def random_noise(image_array):
    return sk.util.random_noise(image_array)

def horizontal_flip(image_array):
    return image_array[:, ::-1]

new_faces = np.zeros([trainingFaces.shape[0]*10, dimensions])
new_trainingSexLabels = np.zeros(trainingFaces.shape[0]*10)  # Sex - 0 = male; 1 = female
new_trainingAgeLabels = np.zeros(trainingFaces.shape[0]*10)  # Age - 0 = child; 1 = teen; 2 = male
new_trainingExpLabels = np.zeros(trainingFaces.shape[0]*10)  # Expression - 0 = serious; 1 = smiling

for i in range(trainingFaces.shape[0]):
    new_index = i*10
    new_faces[new_index, :] = copy.deepcopy(trainingFaces[i]) # the original image
    reshaped = np.reshape(trainingFaces[i], [128,128])
    new_faces[(new_index + 1),:] = np.reshape(random_rotation(reshaped),[128*128]) # the image randomly
    new_faces[(new_index + 2),:] = np.reshape(random_rotation(reshaped),[128*128]) # rotated, four times
    new_faces[(new_index + 3),:] = np.reshape(random_rotation(reshaped),[128*128])
    new_faces[(new_index + 4),:] = np.reshape(random_rotation(reshaped),[128*128])
    new_faces[(new_index + 5),:] = random_noise(trainingFaces[i,:]) # the image with
    new_faces[(new_index + 6),:] = random_noise(trainingFaces[i,:]) # random noise
    new_faces[(new_index + 7),:] = random_noise(trainingFaces[i,:]) # added, four times
    new_faces[(new_index + 8),:] = random_noise(trainingFaces[i,:])
    new_faces[(new_index + 9),:] = np.reshape(horizontal_flip(reshaped),[128*128]) # horizontally flipped
    for j in range(new_index, new_index+10):
        new_trainingSexLabels[j] = trainingSexLabels[i]
        new_trainingAgeLabels[j] = trainingAgeLabels[i]
        new_trainingExpLabels[j] = trainingExpLabels[i]

trainingFaces = copy.deepcopy(new_faces)
trainingSexLabels = copy.deepcopy(new_trainingSexLabels)
trainingAgeLabels = copy.deepcopy(new_trainingAgeLabels)
trainingExpLabels = copy.deepcopy(new_trainingExpLabels)
print("Training data augmented, now", str(trainingFaces.shape[0]), "training images")

## VGGNet Architecture

I implemented the architecture outlined in [2]. They use a stride of 1 for most convolutions so as to maintain the overall size from one to the next, and small conv filters (most are a 3x3 window, some only 1x1). Max-pooling is applied after each small cluster of (2 or 3) convolutional layers, and ReLU is applied on all convolutional and max-pooling layers. Each cluster of layers has the same field of view and the same number of features, and then when max-pooling is applied, a stride of 2 is used to downsize by half, but the number of features is always doubled at these points. The network ends with three fully connected layers, the first two with 4096 features, and the final one with an output size the same as the number of classes (so, 2 for the Sex and Expression tasks, 4 for the Age task). Softmax is applied right at the end to give the actual trainable output: between 0-1 for each class (and they sum to 1), interpretable as the predicted likelihood that the image belongs to that class. The highest of these is taken to be the predicted class for that image.

I implemented four different versions of this architecture, three taken directly from [2], and named VGG-C, VGG-D (also known as VGG-16) and VGG-E (also known as VGG-19) following the naming conventions in the original article. The fourth option is VGG-E1, which is VGG-E but with 1x1 convolutional layers, following the pattern of VGG-C. These layers provide a way to increase the non-linearity of the decision function without affecting the receptive fields of the convolutional layers. It is essentially an excuse to use an additional ReLU. This strategy was originally proposed by [14]. So, VGG-E1 is to VGG-E (both 19 layers) as VGG-C is to VGG-D (both 16 layers). All of these options can be chosen by specifying the "network" parameter at the top of the next cell. 

The detail of the respective architectures are as follows (convolutional layers are denoted as "conv-{window size}-{number of features}"):

| VGG-C        | VGG-D       | VGG-E1  | VGG-E | Name in code |
| ------------- |--------| -----|----- | ----- |
| conv-3-64    | conv-3-64  | conv-3-64  |conv-3-64 | conv1 |
| conv-3-64    | conv-3-64   | conv-3-64  |conv-3-64 | conv2 |
| max-pool       | max-pool    |   max-pool |max-pool| max1 |
| conv-3-128    | conv-3-128  | conv-3-128  |conv-3-128 | conv3 |
| conv-3-128    | conv-3-128   | conv-3-128  |conv-3-128 | conv4 |
| max-pool        | max-pool    |   max-pool |max-pool| max2 |
| conv-3-256    | conv-3-256  | conv-3-256  |conv-3-256 | conv5 |
| conv-3-256    | conv-3-256   | conv-3-256  |conv-3-256 | conv6 |
| conv-1-256    | conv-3-256   | conv-3-256  |conv-3-256 | conv7 |
|     -          |      -        | conv-1-256  |conv-3-256 | conv75 |
| max-pool        | max-pool    |   max-pool |max-pool| max3 |
| conv-3-512    | conv-3-512  | conv-3-512  |conv-3-512 | conv8 |
| conv-3-512    | conv-3-512   | conv-3-512  |conv-3-512 | conv9 |
| conv-1-512    | conv-3-512   | conv-3-512  |conv-3-512 | conv10 |
|     -          |      -        | conv-1-512  |conv-3-512 | conv105 |
| max-pool        | max-pool    |   max-pool |max-pool| max4 |
| conv-3-512    | conv-3-512  | conv-3-512  |conv-3-512 | conv11 |
| conv-3-512    | conv-3-512   | conv-3-512  |conv-3-512 | conv12 |
| conv-1-512    | conv-3-512   | conv-3-512  |conv-3-512 | conv13 |
|     -          |      -        | conv-1-512  |conv-3-512 | conv135 |
| max-pool        | max-pool    |   max-pool |max-pool| max5 |
| fc-4096        | fc-4096     |   fc-4096  |fc-4096 | fc1 |
| fc-4096        | fc-4096     |   fc-4096  |fc-4096 | fc2 |
| fc-n_classes        | fc-n_classes      |  fc-n_classes   |fc-n_classes  | fc3 |

I also converted the labels to one-hot, which makes it convenient to train as the network gives a softmax prediction (0-1) of how likely it is to be in that class. The predicted class is taken to be the maximum of these activations.

The next cell loads the model into the graph. You can choose both the task and which network to train by modifying the *task* and *network* variables. I found that VGG-D and VGG_E tended to be most reliably accurate (results below).

In [None]:
import tensorflow as tf

####### MODIFY THESE PARAMETERS ######

task = "Sex"      # Options are "Sex", "Age", "Expression"
network = "vggD"  # Options are "vggC", "vggD", "vggE", "vggE1"
device = "/cpu:0" # "/cpu:0" or "/gpu:0"
display_step = 1  # How often it prints out progress (1 = every epoch)
saver_step = 10   # How often it saves the model (it will also save any with validation accuracy of higher than 84%)
learning_rate = 0.0001 # I used learning rate = 0.0001

######################################

n_filters_conv1 = 64
filter_size_conv1 = 3
stride1 = 1

n_filters_conv2 = 64
filter_size_conv2 = 3
stride2 = 1

n_filters_conv3 = 128
filter_size_conv3 = 3
stride3 = 1

n_filters_conv4 = 128
filter_size_conv4 = 3
stride4 = 1

n_filters_conv5 = 256
filter_size_conv5 = 3
stride5 = 1

n_filters_conv6 = 256
filter_size_conv6 = 3
stride6 = 1

n_filters_conv7 = 256
filter_size_conv7 = 3
stride7 = 1
if network == "vggC":
    filter_size_conv7 = 1

n_filters_conv75 = 256 ## used for vggE and vggE1 only
filter_size_conv75 = 3
stride75 = 1
if network == "vggE1":
    filter_size_conv75 = 1

n_filters_conv8 = 512
filter_size_conv8 = 3
stride8 = 1

n_filters_conv9 = 512
filter_size_conv9 = 3
stride9 = 1

n_filters_conv10 = 512
filter_size_conv10 = 3
stride10 = 1
if network == "vggC":
    filter_size_conv10 = 1

n_filters_conv105 = 512 ## used for vggE and vggE1 only
filter_size_conv105 = 3
stride105 = 1
if network == "vggE1":
    filter_size_conv105 = 1

n_filters_conv11 = 512
filter_size_conv11 = 3
stride11 = 1

n_filters_conv12 = 512
filter_size_conv12 = 3
stride12 = 1

n_filters_conv13 = 512
filter_size_conv13 = 3
stride13 = 1
if network == "vggC":
    filter_size_conv13 = 1

n_filters_conv135 = 512 ## used for vggE and vggE1 only
filter_size_conv135 = 3
stride135 = 1
if network == "vggE1":
    filter_size_conv135 = 1

fc1_layer_size = 4096
fc2_layer_size = 4096

def make_one_hot(labels):
    global n_classes
    one_label = np.zeros(n_classes)
    new_labels = [one_label] * len(labels)
    for i in range(len(labels)):
        one_label = np.zeros(n_classes)
        one_label[int(labels[i])] = 1
        new_labels[i] = one_label
    return np.array(new_labels)


class Dataset:
    def __init__(self, data, labels):
        self.data = data.reshape([-1, 128, 128, 1])  
        self.labels = labels
        self.batch_index = 0

    def randomize(self, sess):
        shuffled_data = np.empty(self.data.shape, dtype=self.data.dtype)
        shuffled_labels = np.empty(self.labels.shape, dtype=self.labels.dtype)
        permutation = np.random.permutation(len(self.data))
        for old_index, new_index in enumerate(permutation):
            shuffled_data[new_index] = self.data[old_index]
            shuffled_labels[new_index] = self.labels[old_index]
        self.data = shuffled_data
        self.labels = shuffled_labels

    def next_batch(self, b_size):
        start = self.batch_index
        end = self.batch_index + b_size
        self.batch_index = end
        return self.data[start:end], self.labels[start:end]


def conv_relu_layer(input, n_input, n_filters, filter_size, stride):
    weights = tf.Variable(tf.truncated_normal(shape=[filter_size, filter_size, n_input, n_filters], stddev=0.05))
    biases = tf.Variable(tf.constant(0.05, shape=[n_filters]))
    conv_layer = tf.nn.conv2d(input=input, filter=weights, strides=[1, stride, stride, 1], padding='SAME')
    conv_layer += biases
    c_r_layer = tf.nn.relu(conv_layer)
    return c_r_layer

def maxpool_relu_layer(input):
    m_layer = tf.nn.max_pool(value=input, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
    m_r_layer = tf.nn.relu(m_layer)
    return m_r_layer

def flat_layer(input_layer):
    shape = input_layer.get_shape()
    num_features = shape[1:4].num_elements()
    flat_layer = tf.reshape(input_layer, [-1, num_features])
    return flat_layer

def fc_layer(input, n_inputs, n_outputs, use_relu=True):
    weights = tf.Variable(tf.truncated_normal(shape=[n_inputs, n_outputs], stddev=0.05))
    biases = tf.Variable(tf.constant(0.05, shape=[n_outputs]))
    fc_layer = tf.matmul(input, weights) + biases
    if use_relu:
        fc_layer = tf.nn.relu(fc_layer)
    return fc_layer


if task == "Sex":
    n_classes = 2
    train_labels = make_one_hot(trainingSexLabels)
    valid_labels = make_one_hot(validationSexLabels)
    test_labels = make_one_hot(testingSexLabels)
    train_data = Dataset(trainingFaces, train_labels)
    valid_data = Dataset(validationFaces, valid_labels)
    test_data = Dataset(testingFaces, test_labels)
    model = "sex-model"
elif task == "Age":
    n_classes = 4
    train_labels = make_one_hot(trainingAgeLabels)
    valid_labels = make_one_hot(validationAgeLabels)
    test_labels = make_one_hot(testingAgeLabels)
    train_data = Dataset(trainingFaces, train_labels)
    valid_data = Dataset(validationFaces, valid_labels)
    test_data = Dataset(testingFaces, test_labels)
    model = "age-model"
elif task == "Expression":
    n_classes = 2
    train_labels = make_one_hot(trainingExpLabels)
    valid_labels = make_one_hot(validationExpLabels)
    test_labels = make_one_hot(testingExpLabels)
    train_data = Dataset(trainingFaces, train_labels)
    valid_data = Dataset(validationFaces, valid_labels)
    test_data = Dataset(testingFaces, test_labels)
    model = "exp-model"
else:
    print("Please set task to one of the three options")
    sys.stdout.flush()

img_size = 128
num_channels = 1  # greyscale

with tf.device(device):
    # set up VGG
    g = tf.Graph()
    with g.as_default():
        X = tf.placeholder(tf.float32, shape=[None, img_size, img_size, num_channels], name='X')
        y_true = tf.placeholder(tf.float32, shape=[None, n_classes], name='y_true')
        y_true_class = tf.argmax(y_true, dimension=1)

        conv1 = conv_relu_layer(input=X, n_input=num_channels, n_filters=n_filters_conv1,
                                filter_size=filter_size_conv1, stride = stride1)
        conv2 = conv_relu_layer(input=conv1, n_input=n_filters_conv1, n_filters=n_filters_conv2,
                                filter_size=filter_size_conv2, stride = stride2)
        max1 = maxpool_relu_layer(conv2)
        conv3 = conv_relu_layer(input=max1, n_input=n_filters_conv2, n_filters=n_filters_conv3,
                                filter_size=filter_size_conv3, stride=stride3)
        conv4 = conv_relu_layer(input=conv3, n_input=n_filters_conv3, n_filters=n_filters_conv4,
                                filter_size=filter_size_conv4, stride=stride4)
        max2 = maxpool_relu_layer(conv4)
        conv5 = conv_relu_layer(input=max2, n_input=n_filters_conv4, n_filters=n_filters_conv5,
                                filter_size=filter_size_conv5, stride = stride5)
        conv6 = conv_relu_layer(input=conv5, n_input=n_filters_conv5, n_filters=n_filters_conv6,
                                filter_size=filter_size_conv6, stride = stride6)
        conv7 = conv_relu_layer(input=conv6, n_input=n_filters_conv6, n_filters=n_filters_conv7,
                                filter_size=filter_size_conv7, stride = stride7)
        if network == "vggE" or network == "vggE1":
            conv75 = conv_relu_layer(input=conv7, n_input=n_filters_conv7, n_filters=n_filters_conv75,
                                     filter_size=filter_size_conv75, stride=stride75)
            max3 = maxpool_relu_layer(conv75)
        else:
            max3 = maxpool_relu_layer(conv7)
        conv8 = conv_relu_layer(input=max3, n_input=n_filters_conv7, n_filters=n_filters_conv8,
                                filter_size=filter_size_conv8, stride=stride8)
        conv9 = conv_relu_layer(input=conv8, n_input=n_filters_conv8, n_filters=n_filters_conv9,
                                filter_size=filter_size_conv9, stride=stride9)
        conv10 = conv_relu_layer(input=conv9, n_input=n_filters_conv9, n_filters=n_filters_conv10,
                                filter_size=filter_size_conv10, stride=stride10)
        if network == "vggE" or network == "vggE1":
            conv105 = conv_relu_layer(input=conv10, n_input=n_filters_conv10, n_filters=n_filters_conv105,
                                      filter_size=filter_size_conv105, stride=stride105)
            max4 = maxpool_relu_layer(conv105)
        else:
            max4 = maxpool_relu_layer(conv10)
        conv11 = conv_relu_layer(input=max4, n_input=n_filters_conv10, n_filters=n_filters_conv11,
                                 filter_size=filter_size_conv11, stride=stride11)
        conv12 = conv_relu_layer(input=conv11, n_input=n_filters_conv11, n_filters=n_filters_conv12,
                                 filter_size=filter_size_conv12, stride=stride12)
        conv13 = conv_relu_layer(input=conv12, n_input=n_filters_conv12, n_filters=n_filters_conv13,
                                 filter_size=filter_size_conv13, stride=stride13)
        if network == "vggE" or network == "vggE1":
            conv135 = conv_relu_layer(input=conv13, n_input=n_filters_conv13, n_filters=n_filters_conv135,
                                      filter_size=filter_size_conv135, stride=stride135)
            max5 = maxpool_relu_layer(conv135)
        else:
            max5 = maxpool_relu_layer(conv13)
        flat = flat_layer(max5)
        fc1 = fc_layer(input=flat, n_inputs=flat.get_shape()[1:4].num_elements(), n_outputs=fc1_layer_size)
        fc2 = fc_layer(input=fc1, n_inputs=fc1_layer_size, n_outputs=fc2_layer_size)
        fc3 = fc_layer(input=fc2, n_inputs=fc2_layer_size, n_outputs=n_classes, use_relu=False)  
        y_pred = tf.nn.softmax(fc3, name="y_pred")
        y_pred_class = tf.argmax(y_pred, dimension=1)
        cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=fc3, labels=y_true) 
        cost = tf.reduce_mean(cross_entropy)
        optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
        correct_prediction = tf.equal(y_pred_class, y_true_class)
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name="accuracy")

print("Graph initialised")

## Training Details

I randomize the training data at the beginning of each epoch, so that each image is not always seen as part of the same batch and in the same order. An epoch consists of one cycle of every training image (actually, two are missed every epoch because there are 19970 training images, which doesn't divide evenly into batches of 16). Cost is calculated as the mean of the cross-entropy between the output of the final layer of the network and the true labels. I used the Adam Optimizer [16], which adjusts the learning rate as it trains, enabling it to learn faster initially and then slow down the rate of training as it begins to converge on some minimum solution. I used a starting learning rate of 0.0001, because I tried some faster and some slower rates and this seemed to produce the best and most reliable results. Classification accuracy is computed as the percentage of correctly classified images. 

The next cell trains the network -- this takes a long time if not using a GPU, especially if using augmented (and thus 10 times more) training data. I've included the pre-trained models from which I derived my final (best) results.

In [None]:
import os
import sys

####### MODIFY THESE PARAMETERS ###### (I used batch_size = 16)

n_epochs = 200
batch_size = 16

######################################

n_batches = trainingFaces.shape[0] // batch_size
val_batches = validationFaces.shape[0] // batch_size

with tf.device(device):
    with g.as_default():    
        saver = tf.train.Saver()
        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            val_acc = 0
            val_loss = 0
            for val in range(val_batches):
                x_valid_batch, y_valid_batch = valid_data.next_batch(batch_size)
                feed_dict_val = {X: x_valid_batch, y_true: y_valid_batch}
                val_acc += sess.run(accuracy, feed_dict=feed_dict_val)
                val_loss += sess.run(cost, feed_dict=feed_dict_val)
            val_acc = val_acc / val_batches
            val_loss = val_loss / val_batches
            msg = "Pre-training (Epoch {0}) --- Training Accuracy: {1:>6.2%}, Validation Accuracy: {2:>6.2%},  Validation Loss: {3:.3f}"
            print(msg.format(0, 0, val_acc, val_loss))  # , val_loss))
            sys.stdout.flush()
            for i in range(1, n_epochs + 1):
                train_data.randomize(sess)
                train_data.batch_index = 0
                valid_data.randomize(sess)
                valid_data.batch_index = 0
                acc = 0
                val_acc = 0
                val_loss = 0
                for batch in range(n_batches):
                    x_batch, y_true_batch = train_data.next_batch(batch_size)
                    feed_dict_train = {X: x_batch, y_true: y_true_batch}
                    sess.run(optimizer, feed_dict=feed_dict_train)
                    acc += sess.run(accuracy, feed_dict=feed_dict_train)
                if i % display_step == 0:
                    valid_data.batch_index = 0
                    for j in range(val_batches):
                        x_valid_batch, y_valid_batch = valid_data.next_batch(batch_size)
                        feed_dict_val = {X: x_valid_batch, y_true: y_valid_batch}
                        val_acc += sess.run(accuracy, feed_dict=feed_dict_val)
                        val_loss += sess.run(cost, feed_dict=feed_dict_val)
                acc = acc / n_batches
                val_acc = val_acc / val_batches
                val_loss = val_loss / val_batches
                msg = "Training Epoch {0} --- Training Accuracy: {1:>6.2%}, Validation Accuracy: {2:>6.2%},  Validation Loss: {3:.3f}"
                print(msg.format(i, acc, val_acc, val_loss))
                sys.stdout.flush()
                if i % saver_step == 0 or val_acc > 0.84:
                    save_path = saver.save(sess, model+"_"+network+"_"+str(i))
                    
print("Training done")

## Results on this dataset without augmentation

The classification accuracy on testing data on the three tasks and on the four network architectures, trained on the original (non-augmented) training data, is given below. I've also included the number of epochs needed to reach that level of accuracy (after that, they tend to overtrain, and thus decrease a little in accuracy). Recall that we are trying to improve on 50% to beat the random classifier for the Sex task, and 61.4% to beat the most-common-class classifier. For the Age task, we need to improve on 25% and 88.7% respectively, and for the Expression task we need to improve on 50% and 39.9%.

<table>
<tr><th>Sex </th><th>Age</th><th>Expression</th></tr>
<tr><td>

|   Network   | VGG-C | VGG-D | VGG-E1 | VGG-E |
| ------ |--------| -----|----- | ----- |
| Accuracy | 83.27% | 83.37% | 82.06% | 83.47% |
| Epochs | 70 | 90 | 120 | 120 |

</td><td>

|    Network  | VGG-C | VGG-D | VGG-E1 | VGG-E |
| ------ |--------| -----|----- | ----- |
| Accuracy | 90.22% | 91.73% | 90.42% | 91.53% |
| Epochs | 70 | 100 | 190 | 80 |

</td><td>

| Network| VGG-C | VGG-D | VGG-E1 | VGG-E |
| ------ |--------| -----|----- | ----- |
| Accuracy | 83.47% | 85.79% | 84.48% | 84.48% |
| Epochs | 40 | 100 | 90 | 90 |

</td></tr> </table>

All of the networks improve on both naive classifiers, but it is only minimally in the case of the Age task. VGG-D and VGG-E are the best performers, which is unsurprising given that they have become the famous VGG-16 and VGG-19, respectively (as VGG-D has 16 layers of trainable weights, and VGG-E has 19). VGG-C tends to peak more quickly, but never reach the same levels.

## Results on this dataset with augmentation

Because I judged VGG-D and VGG-E to be overall the best performers (the most accurate), I restricted my second investigation to these two architectures. The classifcation accuracy on testing data on these two networks, trained with augmented data, is given below.

<table>
<tr><th>Sex </th><th>Age</th><th>Expression</th></tr>
<tr><td>

| Network     | VGG-D |  VGG-E |
| ------ |-----| ----- |
| Accuracy | 88.91% | 82.06% |
| Epochs |  70 | 30 |

</td><td>

| Network     |  VGG-D| VGG-E |
| ------ |--------| ----- |
| Accuracy |  90.22% | 89.92% |
| Epochs |  20 | 70 |

</td><td>

| Network|  VGG-D | VGG-E |
| ------ |--------| -----|
| Accuracy |  85.89% | 84.07% |
| Epochs |  40 | 20 |

</td></tr> </table>

Accuracy on the Sex task is the only one that significantly improved with data augmentation. Accuracy on the Expression task improved by a negligible amount, and accuracy on the Age task declined a little.

So, the final results of the best version of each of my three classifiers (these models are included in my submission) produce the following results:

| Task| Sex | Age | Expression |
| ------ |--------| -----| -----|
| Accuracy | 88.91%  | 91.73% | 85.89% |

The next cell loads the trained models and computes the labels for the testing data on each of the three classification tasks. If the number of images in the training set is not divisible by 10, then they won't all obtain a prediction. However, 1000 images don't all fit into memory with a network this size, so obtaining a vector (numpy array) of all predictions at once must be done in batches.

In [None]:
#### These are the models I trained (I picked the best I got for each task), included in the zip file. 
#### You can the names and types to check those you've trained yourself.

sexModel = "best_sex-model_vggD"
sexType = "vggD"
ageModel = "best_age-model_vggD"
ageType = "vggD"
expModel = "best_exp-model_vggD"
expType = "vggD"

def predictLabels(data, n_data, model, taskIn, networkIn):
    batch_size = 10 # This assumes the number of images in the training set is divisible by 10 
    task = taskIn  # Options are "Sex", "Age", "Expression"
    network = networkIn # Options are "vggC", "vggD", "vggE", "vggE1"
    device = "/cpu:0" # "/cpu:0" or "/gpu:0"

    n_filters_conv1 = 64
    filter_size_conv1 = 3
    stride1 = 1

    n_filters_conv2 = 64
    filter_size_conv2 = 3
    stride2 = 1

    n_filters_conv3 = 128
    filter_size_conv3 = 3
    stride3 = 1

    n_filters_conv4 = 128
    filter_size_conv4 = 3
    stride4 = 1

    n_filters_conv5 = 256
    filter_size_conv5 = 3
    stride5 = 1

    n_filters_conv6 = 256
    filter_size_conv6 = 3
    stride6 = 1

    n_filters_conv7 = 256
    filter_size_conv7 = 3
    stride7 = 1
    if network == "vggC":
        filter_size_conv7 = 1

    n_filters_conv75 = 256 ## used for vggE and vggE1 only
    filter_size_conv75 = 3
    stride75 = 1
    if network == "vggE1":
        filter_size_conv75 = 1

    n_filters_conv8 = 512
    filter_size_conv8 = 3
    stride8 = 1

    n_filters_conv9 = 512
    filter_size_conv9 = 3
    stride9 = 1

    n_filters_conv10 = 512
    filter_size_conv10 = 3
    stride10 = 1
    if network == "vggC":
        filter_size_conv10 = 1

    n_filters_conv105 = 512 ## used for vggE and vggE1 only
    filter_size_conv105 = 3
    stride105 = 1
    if network == "vggE1":
        filter_size_conv105 = 1

    n_filters_conv11 = 512
    filter_size_conv11 = 3
    stride11 = 1

    n_filters_conv12 = 512
    filter_size_conv12 = 3
    stride12 = 1

    n_filters_conv13 = 512
    filter_size_conv13 = 3
    stride13 = 1
    if network == "vggC":
        filter_size_conv13 = 1

    n_filters_conv135 = 512 ## used for vggE and vggE1 only
    filter_size_conv135 = 3
    stride135 = 1
    if network == "vggE1":
        filter_size_conv135 = 1

    fc1_layer_size = 4096
    fc2_layer_size = 4096

    if task == "Sex" or task == "Expression":
        n_classes = 2
    elif task == "Age":
        n_classes = 4
    else:
        print("Please set task to one of the three options")
        sys.stdout.flush()

    img_size = 128
    num_channels = 1  # greyscale
    
    with tf.device(device):
        # set up VGG
        g = tf.Graph()
        with g.as_default():
            X = tf.placeholder(tf.float32, shape=[None, img_size, img_size, num_channels], name='X')
            y_true = tf.placeholder(tf.float32, shape=[None, n_classes], name='y_true')
            y_true_class = tf.argmax(y_true, dimension=1)

            conv1 = conv_relu_layer(input=X, n_input=num_channels, n_filters=n_filters_conv1,
                                    filter_size=filter_size_conv1, stride = stride1)
            conv2 = conv_relu_layer(input=conv1, n_input=n_filters_conv1, n_filters=n_filters_conv2,
                                    filter_size=filter_size_conv2, stride = stride2)
            max1 = maxpool_relu_layer(conv2)
            conv3 = conv_relu_layer(input=max1, n_input=n_filters_conv2, n_filters=n_filters_conv3,
                                    filter_size=filter_size_conv3, stride=stride3)
            conv4 = conv_relu_layer(input=conv3, n_input=n_filters_conv3, n_filters=n_filters_conv4,
                                    filter_size=filter_size_conv4, stride=stride4)
            max2 = maxpool_relu_layer(conv4)
            conv5 = conv_relu_layer(input=max2, n_input=n_filters_conv4, n_filters=n_filters_conv5,
                                    filter_size=filter_size_conv5, stride = stride5)
            conv6 = conv_relu_layer(input=conv5, n_input=n_filters_conv5, n_filters=n_filters_conv6,
                                    filter_size=filter_size_conv6, stride = stride6)
            conv7 = conv_relu_layer(input=conv6, n_input=n_filters_conv6, n_filters=n_filters_conv7,
                                    filter_size=filter_size_conv7, stride = stride7)
            if network == "vggE" or network == "vggE1":
                conv75 = conv_relu_layer(input=conv7, n_input=n_filters_conv7, n_filters=n_filters_conv75,
                                         filter_size=filter_size_conv75, stride=stride75)
                max3 = maxpool_relu_layer(conv75)
            else:
                max3 = maxpool_relu_layer(conv7)
            conv8 = conv_relu_layer(input=max3, n_input=n_filters_conv7, n_filters=n_filters_conv8,
                                    filter_size=filter_size_conv8, stride=stride8)
            conv9 = conv_relu_layer(input=conv8, n_input=n_filters_conv8, n_filters=n_filters_conv9,
                                    filter_size=filter_size_conv9, stride=stride9)
            conv10 = conv_relu_layer(input=conv9, n_input=n_filters_conv9, n_filters=n_filters_conv10,
                                    filter_size=filter_size_conv10, stride=stride10)
            if network == "vggE" or network == "vggE1":
                conv105 = conv_relu_layer(input=conv10, n_input=n_filters_conv10, n_filters=n_filters_conv105,
                                          filter_size=filter_size_conv105, stride=stride105)
                max4 = maxpool_relu_layer(conv105)
            else:
                max4 = maxpool_relu_layer(conv10)
            conv11 = conv_relu_layer(input=max4, n_input=n_filters_conv10, n_filters=n_filters_conv11,
                                     filter_size=filter_size_conv11, stride=stride11)
            conv12 = conv_relu_layer(input=conv11, n_input=n_filters_conv11, n_filters=n_filters_conv12,
                                     filter_size=filter_size_conv12, stride=stride12)
            conv13 = conv_relu_layer(input=conv12, n_input=n_filters_conv12, n_filters=n_filters_conv13,
                                     filter_size=filter_size_conv13, stride=stride13)
            if network == "vggE" or network == "vggE1":
                conv135 = conv_relu_layer(input=conv13, n_input=n_filters_conv13, n_filters=n_filters_conv135,
                                          filter_size=filter_size_conv135, stride=stride135)
                max5 = maxpool_relu_layer(conv135)
            else:
                max5 = maxpool_relu_layer(conv13)
            flat = flat_layer(max5)
            fc1 = fc_layer(input=flat, n_inputs=flat.get_shape()[1:4].num_elements(), n_outputs=fc1_layer_size)
            #fc1 = fc_layer(input=max5, n_inputs=filter_size_conv13, n_outputs=fc1_layer_size)
            fc2 = fc_layer(input=fc1, n_inputs=fc1_layer_size, n_outputs=fc2_layer_size)
            fc3 = fc_layer(input=fc2, n_inputs=fc2_layer_size, n_outputs=n_classes, use_relu=False)  # n_outputs=n_classes
            y_pred = tf.nn.softmax(fc3, name="y_pred")
            y_pred_class = tf.argmax(y_pred, dimension=1, name="y_pred_class")
            cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=fc3, labels=y_true)
            cost = tf.reduce_mean(cross_entropy)
            optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
            correct_prediction = tf.equal(y_pred_class, y_true_class)
            accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name="accuracy")
            
            saver = tf.train.Saver()

            with tf.Session() as sess:
                saver.restore(sess, model)
                n_batches = n_data//batch_size
                predicted_labels = np.array([])
                for i in range(n_batches):
                    x_test, y_test = data.next_batch(batch_size)
                    feed_dict_val = {X: x_test, y_true: y_test}
                    predicted_labels = np.append(predicted_labels, sess.run(y_pred_class, feed_dict=feed_dict_val))
            return predicted_labels
    
# This function will be used to evaluate your submission.
def computeLabels(faceData):
    n, d = faceData.shape
    # Zero arrays for the labels, should be able to do better than this
    estSexLabels = np.zeros(n)
    estAgeLabels = np.zeros(n)
    estExpLabels = np.zeros(n)
    
    # turn faceData into a Dataset object
    sex_labels = np.array([[0]*2]*n)
    dataset = Dataset(faceData, sex_labels)
    estSexLabels = predictLabels(data=dataset, n_data=n, model=sexModel, taskIn="Sex", networkIn=sexType)

    age_labels = np.array([[0]*4]*n)
    dataset = Dataset(faceData, age_labels)
    estAgeLabels = predictLabels(data=dataset, n_data=n, model=ageModel, taskIn="Age", networkIn=ageType)

    exp_labels = np.array([[0]*2]*n)
    dataset = Dataset(faceData, exp_labels)
    estExpLabels = predictLabels(data=dataset, n_data=n, model=expModel, taskIn="Expression", networkIn=expType)
    
    return estSexLabels, estAgeLabels, estExpLabels

estS, estA, estE = computeLabels(validationFaces)
# print(estS.shape, estA.shape, estE.shape)
# I'll do stuff with the above to evaluate the accuracy of your methods

## Discussion

The performance on the four variations of VGGNet was similar, and without doing several trials to get a measure of the average performance and number of epochs to converge on a solution, I wouldn't claim any general conclusions about their relative performance. Data augmentation significantly improved performance on the Sex task, but not on the other two tasks. It did, however, get them to converge in far fewer epochs, though of course each epoch lets it learn from ten times as many images, so that is unsurprising. Comparing to the naive classifiers, my versions of VGGNet outperformed them both significantly on the Sex and Expression tasks. However, on the Age task, I only obtained a best accuracy of 91.73%, which (while significantly better than the random classifier) is a mere 3% better than the hypothetical classifier that learns the most common class and predicts every image is in that one. Perhaps this task is more difficult to learn, because of the heavy bias toward the 'adult' class. Because predicting that everyone is an adult gives such high accuracy (and, more relevantly for training, such low cost), it is difficult to leap out of this deep local minimum. Perhaps, also, augmenting the data in the uniform way I chose to do it serves to entrench the bias toward the significantly more popular class (though this wouldn't explain the lack of improvement on the Expression task). If this is the case, it could perhaps be mitigated by increasing the number of training examples of the other three classes but not those in the popular class.

In every case, though, the network easily reached 100% accuracy on training data. So, perhaps it is just the case that it is highly subject to overtraining, and the accuracy reflects how similar the testing data is to the training data. A network that has memorised the training data for the Age task will have a heavy bias toward predicting 'adult' for any given image, and if we add the supposition that some significant proportion of the testing images look sufficiently similar to one of the training images as to produce the correct label just through finding the closest match (rather than learning semantic features relevant to classifying this feature), then it is plausible that this would account for the apparantly high accuracy of 91.73%.

I made one last attempt to improve the performance of the classifiers: I introduced dropout to each max-pooling layer (this can be seen at [17]). Dropout is a method where, during training, the activations of some proportion of the neurons in each layer are set to zero (selected at random for each batch). This has the effect of ensuring that the network can't rely on all the features being available to provide information, so it is forced to learn general features that apply to more images (in other words, it prevents overtraining). I tried it with dropout rates of 0.9 (only 10% of units remain in play), 0.5, and 0.25, but none of them helped. They all slowed training down significantly, but the peak accuracy of the networks was much lower than without dropout.

## References

[1] A. Krizhevsky, I. Sutskever, and G. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks." NIPS, 1097--1105, 2012.

[2] K. Simonyan and A. Zisserman. "Very Deep Convolutional Networks for Large-Scale Image Recognition." ICLR, 2015.

[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. "Going Deeper with Convolutions." CoRR, 2014.

[4] K. He, X. Zhang, S. Ren, and J. Sun. "Deep Residual Learning for Image Recognition." CoRR, 2015.

[5] W. S. McCulloch and W. Pitts. "A Logical Calculus of the Ideas Immanent in Nervous Activity." Bulletin of Mathematical Biophysics, 5: 115--133, 1943.

[6] D. O. Hebb. "The Organisation of Behaviour: a Neuropsychological Theory." John Wiley \& Sons, NY, 1949.

[7] F. Rosenblatt. "Principles of Neurodynamics." Spartan, NY, 1962.

[8] M. Minsky and S. Papert. "Perceptrons." MIT Press, Cambridge, MA, 1969.

[9] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. "Learning Internal Representations by Error Propagation." In D. E. Rumelhart, J. L. McClelland, The PDP Research Group (eds.) "Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations." MIT Press, Cambridge, MA, 1986.

[10] Y. LeCun, Y. Bengio, and G. Hinton. "Deep Learning." Nature, 521(7553): 436--444, 2015.

[11] Z. Qawaqneh, A. A. Mallouh, and B. D. Barkana. "Deep Convolutional Neural Network for Age Estimation based on VGG-Face Model." CoRR, 2017.

[12] O. Arriaga, M. Valdenegro-Toro, and P. Plöger. "Real-time Convolutional Neural Networks for Emotion and Gender Classification." Corr, 2017.

[13] https://github.com/hannahcy/face-recognition/blob/master/trainer_resnet.py

[14] https://medium.com/@thimblot/data-augmentation-boost-your-image-dataset-with-few-lines-of-python-155c2dc1baec

[15] M. Lin, Q. Chen, and S. Yan "Network In Network." CoRR, 2013.

[16] D. P. Kingma and J. Ba. "Adam: A Method for Stochastic Optimization." ICLR, 2015.

[17] https://github.com/hannahcy/face-recognition/blob/master/trainer_vgg.py