# Regularizing Neural Networks

### L1 Regularization

L1 regularization, also known as Lasso regularization, is a technique for reducing overfitting in machine learning models by adding a penalty term to the cost function that encourages the model to have sparse weight vectors.

In L1 regularization, the penalty term is proportional to the sum of the absolute values of the weights. Specifically, the cost function is modified as follows:

**J(w) = J_0(w) + lambda * ||w||_1**

where J_0(w) is the original cost function, w is the weight vector, lambda is the regularization parameter (a hyperparameter that controls the strength of the regularization), and ||w||_1 is the L1 norm of the weight vector, defined as the sum of the absolute values of the elements of the vector:

**||w||_1 = sum_i(abs(w[i]))**

The effect of the L1 penalty term is to shrink the weights towards zero and encourage some of them to become exactly zero. This has the effect of reducing the complexity of the model and improving its generalization performance, particularly when the data is high-dimensional and sparse.

In other words, L1 regularization can be used to perform feature selection, by setting the weights of irrelevant features to zero. This can help to improve the interpretability of the model and reduce the risk of overfitting.

To optimize the L1-regularized cost function, we can use an optimization algorithm such as gradient descent or L-BFGS, with the addition of a regularization term in the update rule for the weights.

Overall, L1 regularization is a useful technique for reducing overfitting and performing feature selection in machine learning models. It provides a way to encourage sparsity in the weight vectors, which can improve the interpretability and generalization performance of the model.

### L2 Regulatization (aka Forbenius Norm, aka Weight Decay)

L2 regularization, also known as Ridge regularization, is a technique for reducing overfitting in machine learning models by adding a penalty term to the cost function that encourages the model to have small weight values.

In L2 regularization, the penalty term is proportional to the sum of the squared values of the weights. Specifically, the cost function is modified as follows:

**J(w) = J_0(w) + lambda/m * ||w||_2^2**

where J_0(w) is the original cost function, w is the weight vector, lambda is the regularization parameter (a hyperparameter that controls the strength of the regularization), and ||w||_2^2 is the L2 norm of the weight vector, defined as the sum of the squared values of the elements of the vector:

**||w||_2^2 = sum_i(w[i]^2)**

The effect of the L2 penalty term is to shrink the weights towards zero, without necessarily setting any of them to exactly zero. This has the effect of reducing the complexity of the model and improving its generalization performance, particularly when the data is not sparse.

In other words, L2 regularization can be used to prevent overfitting by reducing the magnitude of the weights, without necessarily eliminating any of the features. This can help to improve the stability and robustness of the model.

To optimize the L2-regularized cost function, we can use an optimization algorithm such as gradient descent or L-BFGS, with the addition of a regularization term in the update rule for the weights.

Overall, L2 regularization is a useful technique for reducing overfitting and improving the generalization performance of machine learning models. It provides a way to encourage small weight values, which can improve the stability and robustness of the model.

### Frobenius norm
The Frobenius norm of a matrix is a measure of the size or magnitude of the matrix. It is defined as the square root of the sum of the squared elements of the matrix:
**||A||_F = sqrt(sum_i(sum_j(A[i,j]^2)))**


where A is the matrix, i and j are the row and column indices, and ||A||_F is the Frobenius norm of A.

In other words, the Frobenius norm of a matrix is the square root of the sum of the squares of all the elements of the matrix. It can be interpreted as the Euclidean norm of the vector obtained by "flattening" the matrix into a column vector.

The Frobenius norm is often used as a regularization term in machine learning models, particularly in linear regression and matrix factorization. By adding the Frobenius norm of the weight matrix to the cost function, we encourage the model to have smaller weight values, which can help prevent overfitting and improve generalization performance.

Overall, the Frobenius norm provides a way to quantify the size or magnitude of a matrix, and is a useful tool in various applications, particularly in linear algebra and machine learning.

The Frobenius norm is useful in machine learning for several reasons:

Regularization: As I mentioned earlier, the Frobenius norm can be used as a regularization term in machine learning models. By adding the Frobenius norm of the weight matrix to the cost function, we can prevent overfitting and improve the generalization performance of the model. This is because the regularization term encourages the model to have smaller weight values, which can help to reduce the complexity of the model and avoid fitting the noise in the training data.

Model Selection: The Frobenius norm can be used as a criterion for model selection. Given a set of models with different sizes (i.e., different numbers of parameters), we can compare the Frobenius norms of the weight matrices of the models and choose the one with the smallest norm. This approach can help to select a simpler and more interpretable model that is less likely to overfit.

Matrix Analysis: The Frobenius norm can be used to analyze the properties of a matrix. For example, the Frobenius norm of the covariance matrix can be used as a measure of the total variability of the data. The Frobenius norm of the difference between two matrices can be used as a measure of their dissimilarity or distance.

Overall, the Frobenius norm is a useful tool in machine learning for regularization, model selection, and matrix analysis. It provides a way to quantify the size or magnitude of a matrix, which can help to improve the performance and interpretability of machine learning models.

## Why does Regularization help with Over Fitting of Neural Networks?
Regularization helps with over fitting in neural networks becuase it effectively deactivates certain nodes within the network and reduces the networks size to prioritise the most important and effective nodes. 

However, regularization does not necessarily reduce the size of the network or prioritize specific nodes. However, it can indirectly have this effect if it encourages the network to focus on the most important features by reducing the influence of less important features. Additionally, some types of regularization, such as dropout, randomly remove nodes during training, which can have the effect of reducing the effective size of the network.

If lambda is a large value, the effect will be to effectively limit the range of the activation function to the linear region (if tanh or sigmoid is used) by reducing the value of W to something small. Likewise, if lambda is small it will have a similar, but less profound effect.

Effectively this makes every layer in the network close to being linear.

## Dropout Regularization

You go through each layer in the network and set a random probability that individual nodes in the layer are removed. The connections to and from the removed nodes are then deleted before running the network through the training set.

### Inverted Dropout (most commonly used version of dropout)
Inverted dropout is a commonly used version of dropout that scales the values of the remaining units to maintain their expected value, even though some units have been dropped out. 

A different set of nodes are randomly dropped out during each iteration of the model's training. The dropout mask is generated anew at each iteration by sampling from a Bernoulli distribution with parameter equal to the keep probability.

**IMPORTANT** This process includes the 0th layer (the input parameters X)

The keep probability determines the expected fraction of nodes that are kept, and the actual fraction of nodes kept varies stochastically across iterations due to the randomness introduced by dropout. This helps to prevent overfitting by adding noise to the activations of the neurons, which can reduce their sensitivity to spurious patterns in the training data.

During testing or prediction, however, the same set of nodes that were kept during training are retained, and the scaling factor is adjusted accordingly to ensure that the expected value of the activations is the same as during training. This ensures that the model's behavior is consistent between training and testing.

- d# = the dropout vector for a particular layer (d1, d2, etc)
- keep_prop = the probability that you will keep the node
- d3 = np.random.randn(a3.shape[0], a3.shape[1]) < keep_prob
- a3 = a3 * d3 (element wise multiplication)
- a3 /= keep_prob (scale up the value to compensate for the loss of neurons to not effect the value of Z4)
- **IMPORTANT** dont use this method during testing as it will just add random noise to your output, instead use it during development

We typically disable dropout during testing by keeping all the neurons and their connections intact.

However, disabling dropout can result in overly confident predictions, which may not accurately reflect the model's uncertainty. To address this issue, one can use an ensemble of predictions made with different dropout masks during testing. In this approach, we generate multiple predictions by applying dropout to the network with different masks, and then combine these predictions to obtain an ensemble prediction. By using multiple dropout masks, we can capture the model's uncertainty and get a more accurate estimate of the prediction's reliability.

### What is Dropout really doing?
Intuition: Can't rely on any one feature, so the network has to spread out weights.

 

### Advanced Dropout Techniques
You can select different keep_prob values for each layer in the network. For layers where you are not concerned with over fitting, you can set the keep_prob to 1.0. Likewise, for layers where overfitting is a concern (layers with more nodes or which occur earlier in the network), smaller keep_prob numbers can be used such as 0.5 or 0.7.

In general, larger keep_prob values (closer to 1.0) are often used for layers with fewer units, which are less prone to overfitting, while smaller keep_prob values (closer to 0.0) are often used for layers with more units, which are more prone to overfitting. The optimal keep_prob values for a given neural network and training data may need to be determined empirically through experimentation.

## Spatial Dropout
If you are applying dropout to a convolutional neural network (CNN) or a similar type of network that operates on image data, you may want to consider using spatial dropout instead of regular dropout. Spatial dropout randomly drops entire channels of feature maps during training, which can be more effective for preserving the spatial structure of the image data. In this case, the keep_prob value would be applied to the fraction of channels that are retained at each iteration.

## Problems with Dropout in relation to J (the cost function)
There are two main problems that can arise when using dropout regularization in neural networks, both of which are related to the cost function J:

- Biased cost function: When using dropout, the cost function J that is minimized during training is a noisy estimate of the true cost function, since different subsets of neurons are dropped out at each iteration. This can introduce bias into the optimization process, causing the model to converge to a suboptimal solution.
- Co-adaptation of neurons: Dropout can encourage neurons to co-adapt, meaning that they learn to work together to compensate for the dropped-out neurons during training. This can lead to a degradation in performance when dropout is turned off during testing, since the neurons that learned to co-adapt may not generalize well to new data.

To avoid these problems, several techniques have been proposed to modify the way dropout is applied during training. Here are a few examples:

- Weight scaling: One common technique to mitigate the biased cost function is to scale the weights during training by the keep probability (i.e., 1/keep_prob), so that the expected value of the activations is unchanged regardless of the dropout rate. This can help to reduce the noise in the cost function and improve optimization.
- Dropout with annealing: Another technique to address the biased cost function is to gradually reduce the dropout rate during training. This can allow the model to converge to a more optimal solution by gradually reducing the amount of noise in the cost function.
- DropConnect: DropConnect is a variant of dropout that randomly drops out entire connections (i.e., weights) between neurons, rather than entire neurons themselves. This can help to prevent co-adaptation of neurons and improve generalization performance.
- Maxout networks: Maxout networks are a type of neural network that use the maxout activation function, which allows multiple neurons to activate on a single input. Maxout networks have been shown to be more robust to dropout than other types of neural networks, and can help to prevent co-adaptation of neurons.

These techniques are not mutually exclusive and can be combined in various ways to further improve performance. The choice of which technique to use may depend on the specific neural network architecture, the training data, and the regularization strategy being used.

## Data Augmentation as a Regularization method
Data augmentation is a technique commonly used in machine learning to increase the effective size of the training set by generating additional training examples from the existing data. By applying transformations to the images, we can create new images that are still representative of the underlying distribution of the data.

- **IMPORTANT** If you are having trouble with training an image classifying model and are unable to aquire a larger training set to train your data on, you can double the size of your training set by mirroring the images horizontally and adding the mirrored images to your training set. This is not as good as aquiring twice as many unique images, but can be used to augment the training set.
- Furthermore, you can rotate the image and zoom in to create a new version of the image in a different orientation
- By doing this you are effectively telling the network that a flipped cat is still a cat and a rotated cat is still a cat (but perhaps it is not wise to flip by 90 degrees or more)
- If you wanted to augment the data in a handwriting detector, in contrast, you can not flip the images, but can rotate and distort them to acheive the same basic result.
- **IMPORTANT** It is important to note that data augmentation should be used judiciously and with an understanding of the underlying data distribution. Applying too many transformations may result in a training set that no longer accurately represents the true distribution of the data, leading to poor generalization performance.

## Early Stopping
Early stopping is a regularization technique in deep learning that aims to prevent overfitting by stopping the training process before the model starts to overfit the training data. The basic idea is to monitor the performance of the model on a validation set during training and stop the training process when the validation error stops improving or starts to increase.

Here's a high-level overview of how early stopping works:

- Split the available data into three sets: a training set, a validation set, and a test set. The training set is used to train the model, the validation set is used to monitor the model's performance during training, and the test set is used to evaluate the final performance of the trained model.
- Train the model on the training set and monitor its performance on the validation set after each epoch (or batch) of training.
- Keep track of the best validation error seen so far. If the validation error starts to increase, stop the training process and return the model that achieved the best validation error.
- Evaluate the final performance of the trained model on the test set to estimate the model's generalization performance.

Early stopping can help to prevent overfitting by stopping the training process before the model starts to memorize the training data. By monitoring the performance of the model on the validation set, we can get an estimate of the model's ability to generalize to new, unseen data. If the validation error stops improving or starts to increase, it indicates that the model has started to overfit the training data and is no longer improving its ability to generalize to new data.

One potential drawback of early stopping is that it can lead to suboptimal solutions if the training process is stopped too early, before the model has had a chance to converge to the best possible solution. To mitigate this, it's important to monitor the validation error carefully and wait until it has stabilized before stopping the training process. Additionally, it's important to use a separate test set to evaluate the final performance of the trained model and ensure that it generalizes well to new data.

The use of overfitting couples the process of optimizing your values of W and b for the cost function with the process of reducing variance and overfitting... Typically, you want to first focus on minimizing the cost function J() before addressing issues related to overfitting, but early stopping prevents this from being possible as you are addressing J() and overfitting with the same regularization technique.

## Normalizing Inputs
Normalization of inputs is a common preprocessing step in deep learning that involves scaling the input features to have zero mean and unit variance. Normalization can help to improve the convergence of the optimization algorithm, prevent the saturation of nonlinear activation functions, and reduce the impact of differences in scale between the features.

- Compute the mean and standard deviation of each input feature across the training set.
- subtract the mean value of the dataset from each individual value of X to center the data around 0
- divide each feature by its standard deviation to scale the data to have unit variance.
- Optionally, apply additional transformations to the data, such as clipping or rescaling, to ensure that the normalized features fall within a suitable range for the activation function.

normalize the variance for each index of X using a sigmoid function similar to this:
- Sigmoid^2 =  **1/m * np.sum(X^2)**
- x /= Sigmoid

**IMPORTANT** Ensure that you use the same standard deviation and mean value to normalize all data including the test, dev, and training set.

**IMPORTANT** Normalization rarely does any harm to the performance of the network and therefore should be applied to pretty much every application.

## Vanishing and Exploding Gradients
Vanishing and exploding gradients are common issues that can occur during the training of deep neural networks, particularly in recurrent neural networks (RNNs) and deep feedforward networks with many layers. These issues can make it difficult or impossible for the model to learn from the training data, leading to poor performance or convergence to suboptimal solutions.

Vanishing gradients occur when the gradients (i.e., derivatives) of the loss function with respect to the weights of the network become very small as they propagate backward through the layers. This can happen when the activation function of the neurons is such that the derivative is close to zero over most of the input range, such as the sigmoid function. When the gradients become very small, the weights are updated very slowly, and the lower layers of the network may not learn anything useful.

Conversely, exploding gradients occur when the gradients become very large and cause the weights to update too quickly, leading to instability and poor convergence. This can happen when the weights are initialized to large values, or when the learning rate is too high.

Both vanishing and exploding gradients can be problematic for deep learning. Vanishing gradients can make it difficult for deep neural networks to learn long-term dependencies, since information from earlier layers gets diluted and lost as it propagates through the network. Exploding gradients can cause the optimization algorithm to oscillate or diverge, preventing the network from converging to a good solution.

Several techniques have been developed to mitigate the vanishing and exploding gradient problems, including:

- Weight initialization: Initializing the weights using carefully chosen methods, such as Glorot initialization or He initialization, can help to mitigate the vanishing and exploding gradient problems.
- Activation functions: Using activation functions that have derivatives that are not close to zero over most of the input range, such as the Rectified Linear Unit (ReLU), can help to mitigate the vanishing gradient problem.
- Gradient clipping: Limiting the size of the gradients during training, either by rescaling the gradients or clipping them to a fixed range, can help to mitigate the exploding gradient problem.
- Long short-term memory (LSTM) and gated recurrent unit (GRU) cells: Using specialized RNN cells, such as LSTMs and GRUs, that are specifically designed to handle long-term dependencies can help to mitigate the vanishing gradient problem in RNNs.

These techniques can be used in combination to improve the training stability and convergence of deep neural networks.

## Weight Initialization for Deep Networks
The larger your number of layers, the smaller your initialized W values should be. As a general rule, the variance of W values should be equal to around 1/n. So if you have four nodes in a layer, the range of your W values should be around 0.25 for those nodes (ranging from -0.125 to 0.125 for instance)

This helps to ensure that the input to each neuron in the current layer has roughly the same scale as the input to neurons in the previous layer. If the variance of the weight values is too high or too low, the activations in the network may become too large or too small, which can lead to numerical instability or saturation of the activation function.

For example, if you have four layers and the number of neurons in each layer is 100, 50, 25, and 10, respectively, the variance of the weight values in each layer should be roughly 1/100, 1/50, 1/25, and 1/10, respectively. To achieve this, the initial range of the weight values can be set to sqrt(1/n), which would result in a range of approximately -0.1 to 0.1 for the first layer, -0.14 to 0.14 for the second layer, -0.2 to 0.2 for the third layer, and -0.32 to 0.32 for the fourth layer.

It's worth noting that these are just rules of thumb, and the optimal weight initialization scheme may depend on the specific architecture and task at hand. More advanced initialization methods, such as Glorot initialization or He initialization, may be more appropriate for certain types of networks and activation functions.

This basically equates to this code:

**W[l] = np.random(W.shape[0], W.shape[1]) * np.sqrt(1/n[l - 1])**

When using a ReLU function this changes slightly to become:

**W[l] = np.random(W.shape[0], W.shape[1]) * np.sqrt(2/n[l - 1])**

Another version of the function the Xaviar initialization takes: the tanh of the square root of 1/n[l-1]. (adding the tanh function to the prior activation)

Yet another approach takes the square root of 2/(n[l-1] + n[l])

This does not solve, but instead helps to combat the vanishing and exploding gradients problem.

## Numerical Approximation of Gradients
It is much more effective to calculate the gradients by both adding and subtracting from the value instead of just adding to it. This takes a two-sided difference instead of a one-sided difference.

A huge downside to using this method is that the network will run at half the speed, but this trade-off is usally worth it due to the increase in accuracy when calculating the gradient


## Gradient Checking
Gradient checking is a technique used to verify the correctness of the gradients computed during backpropagation in deep neural networks. The basic idea is to compare the analytical gradients computed by the backpropagation algorithm with the numerical gradients computed using a finite difference approximation. If the difference between the two is small, it indicates that the backpropagation algorithm is working correctly, while a large difference may indicate a bug or numerical instability.

Here are the steps involved in using gradient checking to debug a network:

- Implement the forward propagation and backpropagation algorithms for the network.
- Choose a small subset of training examples and compute the gradients for each example using the backpropagation algorithm.
- For each weight parameter in the network, compute the numerical gradient by adding a small perturbation (e.g., 1e-5) to the weight value, computing the output of the network using forward propagation, and computing the difference between the outputs with and without the perturbation. The numerical gradient is the difference divided by the perturbation.
- Compare the analytical and numerical gradients for each weight parameter. If the difference is small (e.g., less than 1e-7), the backpropagation algorithm is likely working correctly. If the difference is large (e.g., greater than 1e-4), there may be a bug or numerical instability in the implementation.
- If the difference between the analytical and numerical gradients is large, check for bugs in the forward or backpropagation algorithm, numerical stability issues (e.g., overflows or underflows), or problems with the weight initialization or regularization techniques.

Gradient checking can be computationally expensive and is typically used only for debugging purposes or to validate the correctness of a new implementation. It's important to note that gradient checking does not guarantee that the network will learn the correct function, but rather only verifies that the backpropagation algorithm is implemented correctly.

Some important points about Grad checking:
- Dont use in training - only to debug
- If algorithm fails grad check, look at components to try and identify a bug (compare individual db and dw values)
- Remember regularization
- Does NOT work with dropout
- Run at random initialization, perhaps again after some training.