# L2-Regularization

## Logistic Regression

<font color='red'>
**Note: Incomplete lecture. Linked are the lectures that need to be perused to complete the sections**

$\require{cancel}$
### Regularized Cost Function:

- $\textbf{w} \in \mathbb{R}^{n_x}$
- $b \in \mathbb{R}$

Adding the regularization parameter, $\lambda$ to the cost function:

$$J(\textbf{w}, b) = \frac{1}{m} \sum^{m}_{i=1} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m}\lVert \textbf{w} \rVert^{2}_{2} \quad + \underbrace{\cancel{\frac{\lambda}{2m} b^{2}}}_{omit}$$
where $\lVert \textbf{w} \rVert^{2}_{2} = \sum^{n_x}_{j=1} \textbf{w}^{2}_{j} = \textbf{w}^{T} \textbf{w}$, which is the squared Euclidean norm of the parameter vector $\textbf{w}$, also known as the L2-norm.

Given that $\textbf{w}$ is a high-parameter vector, $b$ is simply a single parameter number, and can be omitted.

## Neural Networks

### Regularized Cost Function:

$$J(\textbf{w}^{[1]}, \textbf{b}^{[1]}, \ldots ,\textbf{w}^{[L]}, \textbf{b}^{[L]}) = \frac{1}{m} \sum^{m}_{i=1} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum^{L}_{l=1} \lVert \textbf{w} \rVert^{2}_{2} \quad + \underbrace{\cancel{\frac{\lambda}{2m} b^{2}}}_{omit}$$

where $\lVert \textbf{w} \rVert^{2}_{2}$ is the Frobenius Norm:

$\lVert w^{[l]} \rVert^{2} = \sum_{i=1}^{n[l]} \sum_{j=1}^{n[l-1]}(w^{[l]}_{i,j})^2$

### Derivative Update:

$$dW^{l} = \text{backprop_term} + \frac{\lambda}{m}W^{[l]}$$

### Gradient Descent:

The weight update rule now becomes:  

$$W^{[l]} := W^{[l]} - \alpha \; dW^{[l]}$$

This can be rearranged to:

$$
\begin{align}
W^{[l]} :&= W^{[l]} - \alpha \; [\text{backprop_term} + \frac{\lambda}{m}W^{[l]}] \\
&= W^{[l]} - \frac{\alpha \; \lambda}{m} \; W^{[l]} - \alpha \; \text{backprop_term} \\
\end{align}
$$

which shows that $W^{[l]}$ gets multiplied by $(1 - \frac{\alpha \; \lambda}{m})$. This is why L2-regularization is also called weight decay.

# Dropout Regularization

## Inverted Dropout

<font color='red'>
This section needs to be clarified and looked into deeper

[Link to lecture](https://www.coursera.org/learn/deep-neural-network/lecture/eM33A/dropout-regularization)

Considering a single layer in a network, with l=3:

Let `keep_prob` be the probability that a given hidden unit will be kept.  
Let `d3` be the dropout dropout layer for the 3rd layer, described by:

```
d3 = np.random.rand(a3.shape[0], a3.shape[1])
a3 = np.multiply(a3, d3)
a3 /= keep_prob
```
`a3` is scaled up by `keep_prob`.

Dropout should not be applied in testing time. It is also interesting to note that using dropout causes the cost function to not be well-defined.

## Other Regularization Methods

1. Increasing training set: Data Augmentation
2. Early stopping

## Normalization

Given an input $\textbf{x}$ with $2$ features, $x_1, x_2$:

Subtract the mean:

$$\mathbf{\mu} = \frac{1}{m} \sum^{m}_{i=1} x^{i}$$
$$\textbf{x} := \textbf{x} - \mathbf{\mu}$$ 

Normalize the variance:

$$\mathbf{\sigma^{2}} = \frac{1}{m} \sum^{m}_{i=1} {x^{i}}^{2}$$

This means $\mathbf{\sigma^{2}}$ is a vector with the variances of each of the features. Finally, take each example, and

$$\textbf{x} \; /= \mathbf{\sigma}$$

*Then use the same $\sigma$ and $\mu$ to normalize the test set as well*

## Vanishing/Exploding Gradients

[Link to lecture](https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients)

Consider a very deep neural network with the same number of neurons for each hidden layer, with 2 input features and a linear activation function, i.e. $g(z) = z$.  
The output variable $\hat{y}$ can then be written as

$$\hat{y} = W^{[l]}\;W^{[l-1]}\;W^{[l-2]} \ldots \underbrace{W^{[3]} \underbrace{W^{[2]} \underbrace{W^{[1]} \textbf{x}}_{\textbf{a}^{[1]} = g(\textbf{z}^{[1]}) = \textbf{z}^{[1]}}}_{\textbf{a}^{[2]} = g(\textbf{z}^{[2]})}}_{\ldots}$$ 

### Weight Initialization for Deep Networks

[Link to lecture](https://www.coursera.org/learn/deep-neural-network/lecture/RwqYe/weight-initialization-for-deep-networks)

The following holds for a single neuron, and assuming a RelU activation function:

Set var($w_i$) = $\frac{2}{n}$  

$W^{[l]} = \text{np.random.randn(<shape>) * np.sqrt}(\frac{2}{n^{[l-1]}})$

**Other variants**

Xavier Initialization: `tanh` $\sqrt{\frac{1}{n^{[l-1]}}}$

*The variance term may also be a hyperparameter to tune*

## Gradient Checking

[Link to lecture](https://www.coursera.org/learn/deep-neural-network/lecture/htA0l/gradient-checking)

