mlp-cw1-questions.tex

%% REPLACE sXXXXXXX with your student number
\def\studentNumber{s1803764}


%% START of YOUR ANSWERS
%% Add answers to the questions below, by replacing the text inside the brackets {} for \youranswer{ "Text to be replaced with your answer." }. 
%
% Do not delete the commands for adding figures and tables. Instead fill in the missing values with your experiment results, and replace the images with your own respective figures.
%
% You can generally delete the placeholder text, such as for example the text "Question Figure 2 - Replace the images ..." 
%
% There are 19 TEXT QUESTIONS (a few of the short first ones have their answers added to both the Introduction and the Abstract). Replace the text inside the brackets of the command \youranswer with your answer to the question.
%
% There are also 3 "questions" to replace some placeholder FIGURES with your own, and 3 "questions" asking you to fill in the missing entries in the TABLES provided. 
%
% NOTE! that questions are ordered by the order of appearance of their answers in the text, and not by the order you should tackle them. Specifically, you cannot answer Questions 2, 3, and 4 before concluding all of the relevant experiments and analysis. Similarly, you should fill in the TABLES and FIGURES before discussing the results presented there. 
%
% NOTE! If for some reason you do not manage to produce results for some FIGURES and TABLES, then you can get partial marks by discussing your expectations of the results in the relevant TEXT QUESTIONS (for example Question 8 makes use of Table 1 and Figure 2).
%
% Please refer to the coursework specification for more details.


%% - - - - - - - - - - - - TEXT QUESTIONS - - - - - - - - - - - - 

%% Question 1: define overfitting
\newcommand{\questionOne} {
\youranswer{when a model learns the ”noise” in the training data so that it does not do well when generalizing to new unseen data}
}

%% Question 2: Summarise the effect increasing width and depth of the architecture had on overfitting
\newcommand{\questionTwo} {
\youranswer{improves the validation accuracy but worsens the validation error. This indicates that as we increase the width/depth of our model it begins to fit "noise" in the training data and thus implies that we should stop training earlier (when the validation accuracy is at it's maximum), and apply regularisation techniques to reduce the model's complexity and thereby reduce the amount of "noise" information retained in the network}
}
%helps until the number of parameters becomes too high so that each layer just memorizes the training data, and you can end up with a neural network that fails to generalize well to new unseen data

%% Question 3: Summarise what your results show you about the effect of the tested approaches on overfitting and the performance of the trained model
\newcommand{\questionThree} {
\youranswer{%dropout was the best technique as it produced the best performing model with 84.45\% validation accuracy at a probability of 0.9. Out of the weight penalty techniques L1 regularisation produced the best model with 84.22\% validation accuracy using a coefficient of 1e-4.
dropout is the most effective regularisation technique for mitigating overfitting as it does a bit more than just regularisation and actually adds robustness to the network. This is due to it's usage of deactivated neurons (which are ignored in the forward and backward propagations) which effectively compares our network with various different ones (given that deactivating neurons acts as a way to randomly change the dimensions of each layer) and chooses the best one. Although, the L1 and L2 regularisation techniques did not produce a validation accuracy as good as that of Dropout, they did produce the best generalisation gaps and came very close to the performance produced by dropout
%as it forces nodes within a layer to probabilistically take on more or less responsibility for the inputs.s, for all combinations of hidden units and layers dropout always improved performance.
} 
}

%% Question 4: Give your overall conclusions
\newcommand{\questionFour} {
\youranswer{
%WORKKKK
the Maxout unit is useful as a means to enhance Dropout's ability as a model averaging technique, however, it does require more trainable parameters than traditional activation functions and does not perform as well as the RELU unit for an increased amount of convolutional filters.
We also found that for the EMNIST dataset the Dropout, L1 regularisation, and L2 regularisation techniques are all very useful for mitigating overfitting. From our Dropout experiments we found that for $0.7 \leqslant p \leqslant 0.95$ (where $p$ is the Dropout probability) that $p$ was directly proportional to validation accuracy and generalisation gap, and was optimal at a value of $0.95$. From our L1/L2 regularisation experiments we found that for $1^{-1} \leqslant \beta \leqslant 1^{-4}$ (where $\beta$ is the regularisation parameter) that $\beta$ was inversely proportional to generalisation gap and validation accuracy, L2 regularisation produces better performing models, and L1 regularisation produces models with lower generalisation gaps.
From our combined Dropout and L1/L2 regularisation experiments we found that for $p = 0.95$ and $1^{-6} \leqslant \beta \leqslant 1^{-4}$ L2 regularisation was the better weight penalty technique given it produced the best validation accuracies across all values of $\beta$, $\beta = 1^{-5}$ was the optimal parameter setting in this context for both L1 and L2 regularisation, and thus the parameters for the best performing model was $\beta = 1^{-5}$ for L2 regularisation and $p = 0.95$ for Dropout.
In the future, to make our models perform even better there are a number of things we could still optimise further: the number of epochs used, the batch size, the depth and width of our model, the type of activation function used, the structure of the Dropout layers in our architecture, the L1/L2 regularization parameters, and the random seed used to initialize our weights and Dropout layers
}
}

%% Question 5: Explain what overfitting is in detail and in your own words
\newcommand{\questionFive} {
\youranswer{it is trained for too long resulting in an increasingly large generalization gap (the difference between the training error and validation error) and a decreasing validation accuracy. These trends indicate that the model is starting to learn "noise" in the training data and thus hinders the model's ability to generalize well to new unseen data such as the validation set}
}

%% Question 6: Discuss ``why'' and ``how'' overfitting occurs, and ``how'' one can identify it is happening
\newcommand{\questionSix} {
\youranswer{Overfitting occurs due to the mathematical techniques we use to train and optimise our ML models. In order to optimise our model we define an error function that represents how far our model's predictions are away from the real labels, thus to create a model with perfect accuracy we want a model that produces 0 error. So in order to minimize the error of our model we can try adjust the weights of our model that moves our error function towards a local/global minimum (by finding where the derivative of this error function is equal to 0). However, the problem here is that this error function is defined by the training data which inevitably stores some "noise" due to various reasons (data collection errors due to inaccurate/miscalibrated sensors, or varying conditions; or noise added by computer processing such as floating point round-off errors etc.), and so as we begin to become close to 0 error we start to fit this "noise" into our model. Thus we must be wary as to when we should stop training our model to prevent fitting this "noise". We can identify when our model starts to overfit when the generalisation gap starts to increase and the validation accuracy starts to decrease, as this indicates that this further training is actually worsening the performance of this model
}
}

%% Question 7: Explain what these figures contain and how the curves evolve, and spot where overfitting occurs. Reason based on the min/max points and velocities (direction and magnitude of change) of the accuracy and error curves
\newcommand{\questionSeven} {
\youranswer{
the training and validation accuracies both improve logarithmically with the number of epochs, however, after epoch 17 we can see that the validation accuracy begins to decrease linearly with the number of epochs. This indicates that this model is starting to overfit to the "noise" in the training set after epoch 17 as the validation accuracy starts to decrease and generalisation gap starts to increase with the number of epochs.
This is further supported by figure~\ref{fig:example_errorcurves} which shows the validation and training errors as we increase the number of epochs. We can see that after epoch 17 the validation error starts to slowly increase in a linear fashion and the generalisation gap ($E_{train} - E_{valid}$) starts to increase exponentially (due to the fact that the training error continues to decrease linearly and the validation error continues to increase linearly).
%, thus if we were given this figure when choosing the optimal model we would choose the model at epoch 17 as this gives the best validation set performance. We see that figure 1b appears like figure 1a flipped upside down, this is to be expected given these graphs are reporting the opposite metrics.
}
}

%% Question 8: Explain your network width experiment results by using the relevant figure and table
\newcommand{\questionEight} {
\youranswer{as given by table~\ref{tab:width_exp} the widest model (128 units) performed the best given it produced the best validation accuracy, however, it was also the most overfitted given it produced the largest generalization gap.
This seems to be the general trend for these width experiments, that as we increase the width of our model, our validation accuracy and generalization gap both increase. 
This is evident from both figure~\ref{fig:width_acccurves} and figure~\ref{fig:width_errorcurves}. In figure~\ref{fig:width_acccurves} we can see that the validation accuracy of the model with 128 units peaks at epoch 20 and after this it decreases slowly in a linear fashion with the number of epochs and converges with the validation accuracies of the thinner models. In figure~\ref{fig:width_errorcurves} we can see that the classification error of the model with 128 units reaches its minimum at epoch 9 and after this it increases slowly as the training error continues to decrease with the number of epochs, indicating a widening of the generalization gap.
From these figures we can also see that these thinner models do not overfit as quickly as the wider ones due to their lower complexity. This lower complexity implies that these thinner networks are not able to store as much information as the wider networks making it harder for them to fit to the training data and thereby store the insignificant features/"noise" from the training data. This is evident in figure~\ref{fig:width_acccurves} as the model with 32 units continues to slowly increase in validation accuracy all the way up to epoch 94 (where it reaches it's maximum validation accuracy) in contrast to the 128 unit model which achieved maximum validation accuracy at epoch 20, and the 64 unit model which achieved maximum validation accuracy at epoch 29. These exponential differences between the optimal epoch numbers for each model are to be expected given the exponential differences in the widths of all these models ($32 = 2^5$, $64 = 2^6$, and $128 = 2^7$)
%The large differences in validation accuracies seen in figure \ref{fig:depth_acccurves} at around epoch 20 are a direct indication of how increasing the width of a model can increase it's performance.
%Given the results from table~\ref{tab:width_exp} alone, if I had to choose one of these models (that was trained for 100 epochs) for implementation I would choose the model with 64 units width. Although it has a slightly worse validation accuracy than that of the 128 unit model it has a far better generalization gap implying it will be a better model for generalizing to new unseen data
%the width of our model is directly proportional to validation accuracy until a certain epoch at which the model begins to become overfitted (given the validation accuracy starts to decrease and the generalization gap starts to increase). This is evident from figure 2a in which you can see that the validation accuracy of the model with 128 units peaks at epoch 20 and after this it decreases slowly and converges with the validation accuracies of the thinner models. We can see that these thinner models do not overfit as quickly as this one due to their lower complexity, this is evident in the graph as the model with 32 units continues to slowly increase in validation accuracy all the way up to epoch 94 (where it reaches it's maximum validation accuracy). From table 1 we can see that the model with 128 units is overfitted due it's near identical validation accuracy with the 64 unit model yet far larger generalization gap
}
}

%% Question 9: Discuss whether varying width affects the results in a consistent way, and whether the results are expected and match well with the prior knowledge (by which we mean your expectations as are formed from the relevant Theory and literature)
\newcommand{\questionNine} {
\youranswer{
%From these results it is evident that varying widths in this neural network architecture affected the results in a consistent way. We know that the larger the width of a model implies a larger complexity and thereby a higher susceptibility to overfitting. Thus as we increased the width of the models they became more susceptible to overfitting and so achieved their peak validation accuracies at much lower epochs. Thus table 1 is rather misleading in the quality of varying widths, as all these metrics were scored after training each model for 100 epochs rather than training each model for their optimal number of epochs.
From these results it is evident that as we increase the width of our model the validation accuracy increases, the classification error decreases, and the number of epochs needed to train our model decreases. However, we must note that this observation only holds up when we stop training our model before it becomes overfitted. We can define this point of overfitting to be the epoch when the given model reaches their optimum validation accuracy as this indicates this model is best at generalising to new unseen data. This overfitting is due to the fact that as we increase the number of training epochs our error function comes closer and closer to reaching a local minimum, but this local minimum is defined by the training samples in our dataset and thus as we get closer to this minimum our network starts to fit noise from our training data preventing it from being able to generalise well. This overfitting is evident across all our models here as they all reach their maximum validation accuracy scores before epoch 100, and after the epoch where this maximum validation accuracy is achieved we can see that for each model the validation accuracy starts to slowly decrease and the classification error starts to slowly increase.
These results were to be expected as the wider the model implies more information can be stored in the network allowing it to make more informed predictions. Furthermore, being able to store more information can also make the model more susceptible to overfitting as if it is trained for too long it can result in the network storing the "noise" from the training data and thus not allowing it to generalize well to new unseen data, which explains why the wider models overfit more quickly than shallower ones
}
}

%% Question 10: Explain your network depth experiment results by using the relevant figure and table
\newcommand{\questionTen} {
\youranswer{
as given by table~\ref{tab:depth_exps} the deepest model (3 layers) performed the best given it produced the best validation accuracy, however, it was also the most overfitted given it produced the largest generalization gap.
This seems to be the general trend for these depth experiments, that as we increase the depth of our model, our validation accuracy and generalization gap both increase. 
This is evident from both figure~\ref{fig:depth_acccurves} and figure~\ref{fig:depth_errorcurves}. In figure~\ref{fig:depth_acccurves} we can see that the validation accuracy of the model with 3 layers peaks at epoch 11 and after this it decreases slowly in a linear fashion with the number of epochs and converges with the validation accuracies of the shallower models. In figure~\ref{fig:depth_errorcurves} we can see that the classification error of the model with 3 layers units reaches its minimum at epoch 7 and after this it increases slowly as the training error continues to decrease with the number of epochs, indicating a widening of the generalization gap.
From these figures we can also see that these shallower models do not overfit as quickly as the deeper ones due to their lower complexity. This lower complexity implies that these shallower networks are not able to store as much information as the deeper networks making it harder for them to fit to the training data and thereby store the insignificant features/"noise" from the training data. This is evident in figure~\ref{fig:depth_acccurves} as the model with 1 layer continues to slowly increase in validation accuracy all the way up to epoch 20 (where it reaches it's maximum validation accuracy) in contrast to the 3 layer model which achieved maximum validation accuracy at epoch 11, and the 2 layer model which achieved maximum validation accuracy at epoch 17. These linear differences between the optimal epoch numbers for each model are to be expected given the linear differences in the depths of all these models
%as given by table~\ref{tab:depth_exps} the deepest model (3 layers) performed the best given it produced the best validation accuracy, however, it was also the most overfitted given it produced the largest generalization gap.
%This seems to be the general trend from these depth experiments, that as we increase the depth of our model, our validation accuracy and generalization gap both increase. 
%This is evident from figure~\ref{fig:depth_acccurves} in which you can see that the validation accuracy of the model with 3 layers peaks at epoch 11 and after this it decreases slowly and converges with the validation accuracies of the thinner models. We can see that these shallower models do not overfit as quickly as this one due to their lower complexity, this is evident in the graph as the model with 1 layer continues to slowly increase in validation accuracy all the way up to epoch 20 (where it reaches it's maximum validation accuracy).
%Given the results from table~\ref{tab:depth_exps} alone, if I had to choose one of these models (that was trained for 100 epochs) for implementation I would choose the model with 3 layers. Although it has the worst generalization gap this model achieved the best validation accuracy performing almost 1\% better than the model with 2 layers
}
}

%% Question 11: Discuss whether varying depth affects the results in a consistent way, and whether the results are expected and match well with the prior knowledge (by which we mean your expectations as are formed from the relevant Theory and literature)
\newcommand{\questionEleven} {
\youranswer{From these results it is evident that as we increase the depth of our model the validation accuracy increases, the classification error decreases, and the number of epochs needed to train our model decreases. However, we must note that this observation only holds up when we stop training our model before it becomes overfitted. We can define this point of overfitting to be the epoch when the given model reaches their optimum validation accuracy as this indicates this model is best at generalising to new unseen data. This overfitting is due to the fact that as we increase the number of training epochs our error function comes closer and closer to reaching a local minimum, but this local minimum is defined by the training samples in our dataset and thus as we get closer to this minimum our network starts to fit noise from our training data preventing it from being able to generalise well. This overfitting is evident across all our models here as they all reach their maximum validation accuracy scores long before epoch 100, and after the epoch where this maximum validation accuracy is achieved we can see that for each model the validation accuracy starts to slowly decrease and the classification error starts to slowly increase.
These results were to be expected as the deeper the model implies more information can be stored in the network allowing it to make more informed predictions. Furthermore, being able to store more information can also make the model more susceptible to overfitting as if it is trained for too long it can result in the network storing the "noise" from the training data and thus not allowing it to generalize well, which explains why the deeper models overfit more quickly than shallower ones
}
}

%% Question 12: Compare and discuss how varying width and height changes the performance and overfitting in your experiments
\newcommand{\questionTwelve} {
\youranswer{
From table~\ref{tab:width_exp} and table~\ref{tab:depth_exps} we can see that increasing depth is a more effective means to increase performance as it achieves better validation accuracy across all models (when looking at the optimal validation accuracy across 100 epochs). We can also see from these tables that increasing width makes the model less susceptible to overfitting than increasing the depth given the far smaller generalization gaps achieved across all the width experiment models. When looking at figure \ref{fig:width_acccurves} and figure \ref{fig:depth_acccurves} we can see that there is a far larger gap in accuracies between the models of varying width which indicates the rapid improvement increasing width can have on a model's performance. 
Finally, when analysing these two methods for developing neural networks we must also keep in mind the computational costs of each of these, as different applications may demand different amounts of computational efficiency. In a time-critical application (where computation speed needs to be minimised as much as possible) it is more computationally effective to widen the network than increase the depth of the network as wider networks allow many multiplications to be completed in parallel, unlike deeper networks which require more sequential operations (since the computations depend on the outputs of the previous layers) \cite{DBLP:journals/corr/ZagoruykoK16}
}
}

%% Question 13: Explain L1/L2 weight penalties first in words and then with formulas. Explain how they are incorporated to training and what hyperparameter(s) they require
\newcommand{\questionThirteen} {
\youranswer{
These techniques work by adding a regularization term to the error function which acts as a penalty for complex models with large weights. This is so that when we train our model using gradient descent, our error function will gravitate towards a local minimum that does not take "noisy"/insignificant features into account. The mathematical formula for the error function without regularization is as follows:
$$ E^n = E^{n}_{train}$$
\\
\\
We incorporate L1 regularisation into training our model  by adding a regularisation term to this error function that penalizes large weights: 
%L1 regularisation is the preferred choice when having a high number of features as it provides sparse solutions. In L1 the weights shrink to 0 at a constant rate ($\beta \text{sgn}(w_i)$), thus the new mathematical formula for the loss function with L1 regularization is as follows:
$$ E^n = E^{n}_{train} + \beta \sum_{i=1}^{n}|w_i|$$
$$ \text{where } \beta \text{ is the regularization parameter}$$

And now we can try to reach the \emph{new} local/global minimum of our L1 error function by using the following equation:

$$ \frac{\partial E^n}{\partial w_i} = \frac{\partial E_{train}^{n}}{\partial w_i} + \beta \text{sgn}(w_i)$$
$$\text{where } \text{sgn}(w_i) \text{ is the sign of }w_i$$ 
\\
\\
%This added regularisation term evidently indicates that L1 regularisation penalizes the weights of a model by calculating the sum of the absolute value of the weights 
%L2 regularisation can deal with multicollinearity (independent variables that are highly correlated) problems through constricting the coefficient and by keeping all the variables. L2 regression can be used to estimate the significance of predictors and based on that it can penalize the insignificant predictors. In L2 weights shrink to 0 at a rate proportional to the size of the weight ($ \beta w_i$), 
We incorporate L2 regularisation into training our model  by adding a regularisation term to this error function that penalizes large weights: 
$$ E^n = E^{n}_{train} + \beta \sum_{i=1}^{n}w_i^2 $$
$$ \text{where } \beta \text{ is the regularization parameter} $$

And now we can try to reach the \emph{new} local/global minimum of our L2 error function by using the following equation:
$$ \frac{\partial E^n}{\partial w_i} = \frac{\partial E_{train}^{n}}{\partial w_i} + \beta w_i$$
\\

This regularization parameter $\beta$ is evidently extremely important to both these regularization techniques as it directly affects the complexity and training-data fit of the model. Given this parameter is handpicked it is extremely important that it is set optimally. The choice of this parameter is entirely dependent on how far we want to simplify our model (as increasing it's magnitude strengthens this regularization effect) so we must keep the initial performance and fitting of our model in mind when choosing this. As shown in table~\ref{tab:hp_search} a good method to choose this parameter would be to test varying size $ \beta$ values on the validation set and choose the one that produces the best validation accuracy
}
}

%% Question 14: Discuss how/why the weight penalties may address overfitting, discuss how L1 and L2 regularization differ and support your claims with references where possible
\newcommand{\questionFourteen} {
\youranswer{
These regularisation techniques address overfitting by penalising large weights in models. Penalising these large weights helps to reduce the complexity of the model and ultimately retain less noise from the training set. The main difference between L1 and L2 regularisation is the regularisation terms they add to the error function. In L1 regression weights shrink to 0 at a constant rate ($\beta \text{sgn}(w_i)$) in contrast to L2 regression where weights shrink to 0 at a rate proportional to the magnitude of the weight ($\beta w_i$). This difference is due to the fact that L1 regularization penalizes the sum of the absolute value of weights whereas L2 regularization penalizes the sum of square weights. Such differences in these regularization terms give each of these methods different strengths for different types of data. L1 regularisation is particularly useful for feature selection as it produces a sparse solution and allows us to drop features based on the weights that go to 0 \cite{neelam_tyagi_2021}. L2 regularisation on the other hand produces non-sparse solutions and is useful when you have collinear/codependent features in your dataset. This is due to the fact that codependence tends to increase weight variance, which makes the weights unreliable/unstable and can end up hurting the model's generality. L2 fixes this by reducing the variance of these estimates which counteracts the effect of codependencies \cite{explained_regularization_2021}
}
}

%% Question 15: Explain the experimental details (e.g. hyperparameters), discuss the results in terms of their generalization performance and overfitting
\newcommand{\questionFifteen} {
\youranswer{
From our Dropout experiments it is evident to see that validation accuracy and generalisation gap are both directly proportional to the Dropout probability when $0.7 <= p <= 0.95$ (where $p$ is the Dropout probability). This was achieved by applying a Dropout layer before each Affine layer in our network (resulting in 4 Dropout layers), and using the same Dropout probability for each Dropout layer. These results were to be expected given that the higher the Dropout probability implies less nodes to drop and thus more information is retained in the network meaning higher validation accuracy, but since we are retaining more information this also makes our model more susceptible to overfitting thus widening the generalisation gap. However, these results do not cover any cases where $0.5 <= p < 0.7$, this is problematic as we cannot extrapolate the trend found from our 3 experiments to learn about the performance for Dropout layers with $0.5 <= p < 0.7$. Such results would have been useful to get a better insight when choosing the Dropout probability for our combination experiments.

From our L1/L2 regularisation experiments it is evident that L1 regularisation is less susceptible to overfitting than L2 regularisation based on the far lower generalization gaps this method achieved as shown in table~\ref{tab:hp_search}. From these results we can also see that the L1 regularisation parameter is inversely proportional to the validation accuracy and generalisation gap. The L2 regularisation parameter seemed to follow the same general trend except for the validation accuracy, in which after the L2 regularisation parameter was set to 1e-4 the validation accuracy decreased from that achieved by setting the parameter to 1e-3. However, we must be careful before making any deductions as this parameter producing a better network for the validation set could just be due to luck/coincidence given there was only a 0.6\% difference in accuracy between these hyperparameter settings. This trend is to be expected given that the smaller the magnitude of the regularisation parameter implies the smaller the penalty applied to large weights, meaning more information retained in the network and thus the validation accuracy will continue to improve until it reaches a point of overfitting, but since we are retaining more information this also makes our model progressively more susceptible to overfitting thus widening the generalisation gap.

For my combined regularisation method experiments I carefully chose what hyperparameters to use. I wanted to use my findings from my regularisation experiments as a way to inform what parameters would be best for Dropout and the L1/L2 weight penalties. However, when choosing these hyperparameter settings I also wanted to be representative of the range of possible parameter settings as although we know what parameters seem to perform best when using these regularisation techniques by themselves we cannot be sure as to how they will react when used in conjunction.

From the Dropout experiments it was evident that setting the Dropout probability to 0.95 was the optimal choice since it gave the highest validation accuracy of 86.15\%. Thus for all my combination experiments I decided to keep the Dropout probability fixed at 0.95 to maximise validation accuracy. Although this parameter setting produced the largest generalisation gap we can be less worried about this given we will be using it in conjunction with L1/L2 regularisation which should help mitigate this overfitting even further.

From the results of our L1/L2 regularisation experiments as shown in table~\ref{tab:hp_search} L1 seems like the more sensible choice. This is due to the fact that it achieved the highest validation accuracy from these experiments at 85.25\% for a parameter setting of 1e-4, and achieved far smaller generalisation gaps than L2 for the same parameter settings. However, the differences in validation accuracy were small and I wanted to be representative in my combination experiments so I decided to use both techniques.

For each of these regularisation techniques the smaller the parameter seemed to imply the better the accuracy, thus in conjunction with 1e-4 (since it achieved the best score on average across L1 and L2 experiments) I decided to use even smaller regularisation parameters for our combination experiments (1e-5 and 1e-6).

Thus as shown in table~\ref{tab:hp_search} the hyperparameter combinations I used were as follows:
\begin{itemize}
\setlength\itemsep{0em}
    \item Dropout probability $= 0.95$
    \item Weight penalty technique $\in \{\text{L1}, \text{L2}\}$
    \item L1/L2 regularisation parameter $\in \{1^{-4}, 1^{-5}, 1^{-6}\}$
\end{itemize}

After running all these experiments my findings proved very useful. As shown in figure~\ref{fig:extra} L2 regularisation proved to be the superior weight penalty technique to L1 given it achieved higher validation accuracies across all possible weight decay settings. Although, it ended up producing worse generalisation gaps than L1 for all weight decay settings this was to be expected based on the generalisation gap results from the lone L1 and L2 experiments. In addition, again as shown in figure~\ref{fig:extra} 1e-5 proved to be the optimal regularisation parameter setting given it produced the best validation accuracies for both L1 and L2 regularisation. 

Thus in line with the findings discussed above my best model ended up using L2 regularisation with parameter setting 1e-5, and Dropout with probability 0.95. This model achieved 86.52\% validation accuracy and 84.87\% test accuracy 
}
}

%% Question 16: Explain the motivation behind Maxout Networks as presented in \cite{goodfellow2013maxout}
\newcommand{\questionSixteen} {
\youranswer{
it has not previously been demonstrated to actually perform model averaging for deep architectures. Dropout is generally seen as an indiscriminately applicable tool that
maxout networks are designed to both facilitate optimisation by dropout and improve the accuracy of dropout's fast approximate model averaging technique.
The maxout network is a feed-forward architecture that uses a new type of activation function called the maxout unit. Given an input a maxout hidden layer implements the following function:
$$ h_i(x) = \max_{j \in [1,k]} z_{ij} $$
$$\text{where } z_{ij}=x^T W_{...ij} + b_{ij} \text{ , and } W \in \mathbb{R}^{d \times m \times k} \text{ and } b \in \mathbb{R}^{m \times k}$$
$$\text{ are learned parameters}$$

%WORKKKK
A single maxout unit can be interpreted as making a piecewise linear approximation to an arbitrary convex function. 

Maxout networks are useful because they learn not just the relationship between hidden units, but also the activation function of each hidden unit. The Maxout units are particularly useful as a multilayer perceptron containing 2 maxout units can be used to arbitrarily approximate any function
}
}

%% Question 17: State whether Dropout is compatible (can be used together) with Maxout and explain why
\newcommand{\questionSeventeen} {
\youranswer{
We know that Dropout is compatible with Maxout given Maxout was developed as a method to enhance Dropout's abilities as a model averaging technique. This is further supported by the founding paper on Maxout networks \cite{goodfellow2013maxout} which actually used Maxout in conjunction with Dropout to test Maxout's performance on benchmark datasets (including MNIST)
}
}

%% Question 18: Give an overview of the experiment setup in \cite{goodfellow2013maxout} and analyse it from the point of view of how convincing their conclusions are 
\newcommand{\questionEighteen} {
\youranswer{
%WORKKK
The setup for this experiment was good as they decided to test their novel method on benchmark datasets, thus making it compareable to the performance of existing neural networks with different regularisation methods, and/or activation units. I also thought using logic proofs to prove that Maxout is a universal approximator was very good as it provides solid evidence for this claim, and ensures a conservation of logic. 

However, there were a few areas in this paper which were not exhaustively covered and added some ambiguity to the quality of this method. Firstly, I believe they should have provided error scores for models fitted with Dropout alone, and with Maxout alone. As although they said it was "particularly well suited for training with dropout" it is still useful to be able to compare the performances of these techniques independently. Secondly, given a network with Maxout activation units has a higher number of trainable parameters compared to traditional acitvation functions it is unclear whether Maxout's increased performance was mainly/solely due to an the increase in the number of trainable parameters or not. Thirdly, in the conclusion of this paper it was stated that "We have shown empirical evidence that dropout attains a good approximation to model averaging in deep models. We have shown that maxout exploits this model averaging behavior because the approximation is more accurate for maxout units than for tanh units". Although tanh is a very reliable and good activation unit this does not mean it is the optimal activation unit in every context. It would have been better if they performed tests on more types of activation units to make the performance of Maxout more compareable. This ambiguity actually resulted in a paper being published that investigated the quality of Maxout in comparison to many other activation units (RELU, Leaky RELU, SELU, and TANH) \cite{castaneda_morris_khoshgoftaar_2019}. This paper found that Maxout networks trained relatively slower than networks with traditional activation functions (which can be attributed to the increase in trainable parameters). Furthermore, the RELU activation function ended up performing better than any Maxout function when the number of convolutional filters was increased   
}
}

%% Question 19: Briefly draw your conclusions based on the results from the previous sections (what are the take-away messages?) and conclude your report with a recommendation for future directions
\newcommand{\questionNineteen} {
\youranswer{
From these results it is evident that for the EMNIST dataset the Dropout, L1 regularisation, and L2 regularisation techniques are all very useful for mitigating overfitting. From our Dropout experiments (as shown in figure~\ref{fig:dropoutrates}) we found that for $0.7 \leqslant p \leqslant 0.95$ (where $p$ is the Dropout probability) that $p$ was directly proportional to validation accuracy and generalisation gap, and was optimal at a value of $0.95$. From our L1/L2 regularisation experiments (as shown in figure~\ref{fig:weightrates}) we found that for $1^{-1} \leqslant \beta \leqslant 1^{-4}$ (where $\beta$ is the L1/L2 regularisation parameter) that $\beta$ was inversely proportional to generalisation gap and validation accuracy, L2 regularisation produces better performing models, and L1 regularisation produces models with lower generalisation gaps.
From our combined Dropout and L1/L2 regularisation experiments (as shown in figure~\ref{fig:hp_search}) we found that for $p = 0.95$ and $1^{-6} \leqslant \beta \leqslant 1^{-4}$ L2 regularisation was the better weight penalty technique given it produced the best validation accuracies across all values of $\beta$, $\beta = 1^{-5}$ was the optimal parameter setting in this context for both L1 and L2 regularisation, and thus the parameters for the best performing model was $\beta = 1^{-5}$ for L2 regularisation and $p = 0.95$ for Dropout.

In the future, to make our models perform even better there are a number of things we could still optimise further: the number of epochs used, the batch size, the depth and width of our model, the type of activation function used, the structure of the Dropout layers in our architecture, the L1/L2 regularization parameters, and the random seed used to initialize our weights and Dropout layers.

The first step I would take towards optimising my models would be through optimising the number of epochs used. Throughout all our experiments we set the number of epochs to be constant at 100. This is evidently not a realistic setting to use in implementation as we would rather choose the number of epochs dynamically based on what achieves the best validation accuracy.

Secondly, I would optimise the batch size used by gradient descent in my model. Throughout all our experiments we have kept the batch size fixed at 100, however, we should rather set this dynamically based on what setting enables our model to perform best. Changing this batch size could dramatically improve our models ability to generalise: "it has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize." - \cite{DBLP:journals/corr/KeskarMNST16}

Thirdly, to optimise the shape (width and depth) of our model I would mainly focus on increasing the width as our experiments from figure~\ref{fig:depth} indicate that increasing the depth has minimal performance returns after layer 2, given the model with 3 layers achieved an optimal validation accuracy of 84.41\% (a 0.01\% performance decrease from the 2 layer model). In contrast, we can see in figure~\ref{fig:width} that the performance returns for increasing the width are much greater (as illustrated by the difference between the optimal validation accuracies achieved by each width model), given the model with 128 units achieved an optimal validation accuracy of 83.48\% (a 1.47\% performance increase from the 64 unit model). However, we must note that this performance increase was due to an exponential increase in the number of hidden units ($2^6 -> 2^7$) and thus any further significant increases in performance may require another exponential increase in the number of hidden units and thus an exponential increase in computation.

Next, to optimise the activation unit used I would like to test varying different types of activation units and choose the one which performs best. In particular, I would like to test the Maxout unit in conjunction with Dropout to enhance Dropout's abilities as a model averaging technique. 

After this, to optimise our Dropout layers even further we could optimise the number of layers used, their placement within the network (before which Affine layers), and the unique probabilities used for each of these layers. I would start off by setting $p$ in the input layer as 0.95 (based on our results), to be between 0.5 and 0.8 in the hidden layers, and no Dropout on the output layer \cite{machinelearningmastery_2018}. From here I would would then optimise each of these layers' parameters by training multiple models while varying these parameter values. Once I have a sufficient amount of models with varying Dropout layer probabilities I would plot all this data like in figure~\ref{fig:dropoutrates} as a means to try find a trend in these parameter settings and ultimately determine which settings would be optimal.

Now, to optimise the L1 and L2 regularization models even further we could optimise the regularisation parameter for the weights and bias separately. This could be done by iterating through various regularisation parameter settings, plotting their performances and choosing the parameter settings that produced the best performing model.

Lastly, to optimise the random seed we would just iterate through various random seeds to find which produces the optimal start point for the weights and Dropout layers. This optimisation is important in the context of gradient descent due to the fact we want to find the global minimum of the error function. This is due to the fact that performing gradient descent will always bring us to a local minimum but not necessarily a global minimum as it finds the minimum closest to the initialised weights. Thus finding this global minimum is fully dependent on the initialised weights (and thus the random seed) as this denotes which minimum our model starts closest to
}
}


%% - - - - - - - - - - - - FIGURES - - - - - - - - - - - - 

%% Question Figure 2:
\newcommand{\questionFigureTwo} {
\youranswer{%Question Figure 2 - Replace the images in Figure 2 with figures depicting the accuracy and error, training and validation curves for your experiments varying the number of hidden units.
%
\begin{figure}[t]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[width=\linewidth]{figures/width-fig1.pdf}
        \caption{accuracy by epoch}
        \label{fig:width_acccurves}
    \end{subfigure} 
    \begin{subfigure}{\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/width-fig2.pdf}
        \caption{error by epoch}
        \label{fig:width_errorcurves}
    \end{subfigure} 
    \caption{Training and validation curves in terms of classification accuracy (a) and cross-entropy error (b) on the EMNIST dataset for different network widths.}
    \label{fig:width}
\end{figure} 
}
}

%% Question Figure 3:
\newcommand{\questionFigureThree} {
\youranswer{%Question Figure 3 - Replace these images with figures depicting the accuracy and error, training and validation curves for your experiments varying the number of hidden layers.
%
\begin{figure}[t]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[width=\linewidth]{figures/depth-fig1.pdf}
        \caption{accuracy by epoch}
        \label{fig:depth_acccurves}
    \end{subfigure} 
    \begin{subfigure}{\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/depth-fig2.pdf}
        \caption{error by epoch}
        \label{fig:depth_errorcurves}
    \end{subfigure} 
    \caption{Training and validation curves in terms of classification accuracy (a) and cross-entropy error (b) on the EMNIST dataset for different network depths.}
    \label{fig:depth}
\end{figure} 
}
}

%% Question Figure 4:
\newcommand{\questionFigureFour} {
\youranswer{
%Question Figure 4 - Replace these images with figures depicting the Validation Accuracy and Generalisation Gap for each of your experiments varying the Dropout inclusion rate, L1/L2 weight penalty, and for the 8 combined experiments (you will have to find a way to best display this information in one subfigure).
%
\begin{figure*}[t]
    \centering
    \begin{subfigure}{.3\linewidth}
        \includegraphics[width=\linewidth]{figures/dropout-val-acc-gen-gap-scaled.pdf}
        \caption{Metrics by inclusion rate}
        \label{fig:dropoutrates}
    \end{subfigure} 
    \begin{subfigure}{.3\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/l1-l2-val-acc-gen-gap.pdf}
        \caption{Metrics by weight penalty}
        \label{fig:weightrates}
    \end{subfigure} 
    \begin{subfigure}{.3\linewidth}
        \centering
        \includegraphics[width=1\linewidth]{figures/combined-val-acc-gen-gap-line.pdf}
        \caption{Metrics by combining inclusion rate and weight penalty}
        \label{fig:extra}
    \end{subfigure} 
    \caption{Hyperparameter search for every method and combinations}
    \label{fig:hp_search}
\end{figure*}
}
}

%% - - - - - - - - - - - - TABLES - - - - - - - - - - - - 

%% Question Table 1:
\newcommand{\questionTableOne} {
\youranswer{
%Question Table 1 - Fill in Table 1 with the results from your experiments varying the number of hidden units.
%
\begin{table}[t]
    \centering
    \begin{tabular}{c|cc}
    \toprule
        \# hidden units & val. acc. & generalization gap \\
    \midrule
         32            &      77.94\%      &      0.148              \\
         64            &      80.91\%      &      0.344              \\
         128           &      80.92\%      &      0.803              \\ 
    \bottomrule
    \end{tabular}
    \caption{Validation accuracy (\%) and generalization gap (in terms of cross-entropy error) for varying network widths on the EMNIST dataset.}
    \label{tab:width_exp}
\end{table}
}
}

%% Question Table 2:
\newcommand{\questionTableTwo} {
\youranswer{
%Question Table 2 - Fill in Table 2 with the results from your experiments varying the number of hidden layers.
%
\begin{table}[t]
    \centering
    \begin{tabular}{c|cc}
    \toprule
        \# hidden layers & val. acc. & generalization gap \\
    \midrule
         1               &      80.92\%      &    0.803               \\
         2               &      81.56\%      &    1.456               \\
         3               &      82.51\%      &    1.538               \\ 
    \bottomrule
    \end{tabular}
    \caption{Validation accuracy (\%) and generalization gap (in terms of cross-entropy error) for varying network depths on the EMNIST dataset.}
    \label{tab:depth_exps}
\end{table}
}
}

%% Question Table 3:
\newcommand{\questionTableThree} {
\youranswer{
%Question Table 3 - Fill in Table 3 with the results from your experiments varying the hyperparameter values for each of L1 regularisation, L2 regularisation, and Dropout (use the values shown on the table) as well as the results for your experiments combining L1/L2 and Dropout (you will have to pick what combinations of hyperparameter values to test for the combined experiments; each of the combined experiments will need to use Dropout and either L1 or L2 regularisation; run an experiment for each of 8 different combinations). Use \textit{italics} to print the best result per criterion for each set of experiments, and \textbf{bold} for the overall best result per criterion.
%
\begin{table*}[t]
    \centering
    \begin{tabular}{c|c|cc}
    \toprule
        Model    &  Hyperparameter value(s) & Validation accuracy & Generalization gap \\
    \midrule
    \midrule
        Baseline &  -                    &               0.836 &                 0.290 \\
    \midrule
        \multirow{3}*{Dropout}
                 & 0.7                   &   0.817                  &  \emph{0.030}    \\
                 & 0.9                   &   0.858   &  0.095           \\
                 & 0.95                  &   \emph{0.862}                   &  0.142           \\
    \midrule
        \multirow{3}*{L1 penalty}
                 & 1e-4                   &  \emph{0.853}                   &  0.074           \\
                 & 1e-3                   &  0.747                   &  0.005           \\
                 & 1e-1                   &  0.021                    &  \textbf{\emph{0}}               \\
    \midrule
        \multirow{3}*{L2 penalty}  
                 & 1e-4                   &  0.845                   & 0.234            \\
                 & 1e-3                   &  \emph{0.851}                   & 0.100            \\
                 & 1e-1                   &  0.02                   & \textbf{\emph{0}}                \\
    \midrule
        \multirow{6}*{Combined}  
                 & 0.95, L1 1e-4  &      0.854               &   \emph{0.042}                  \\
                 & 0.95, L1 1e-5                    &     0.864                &              0.121           \\
                 & 0.95, L1 1e-6                   &    0.859                 &      0.139                   \\
                 & 0.95, L2 1e-4                   &     0.864                &      0.115                   \\
                 & 0.95, L2 1e-5                   &   \textbf{\emph{0.865}}                  &      0.138                   \\
                 & 0.95, L2 1e-6                   &  0.861                  &      0.140                   \\
    \bottomrule
    \end{tabular}
    \caption{Results of all hyperparameter search experiments. \emph{italics} indicate the best results per series and \textbf{bold} indicate the best overall}
    \label{tab:hp_search}
\end{table*}
}
}

%% END of YOUR ANSWERS