# Week 1: Practical Aspects of Deep Learning

### Train/Dev/Test sets

![Train/Dev/Test sets](images/train_dev_test.png)

A few rules of thumb:
- Make sure dev and test come from same distribution that your train set comes from.
- Not having a test set **might** be okay (only dev set). Although not recommended, if a more accurate measure of the performance is not necessary it is ok to not use a test set. However, this might cause an overfit to the dev set.

### Bias and variance

![Bias and Variance](images/bias_variance1.png)

How to evaluate bias and variance of a model?

| Error | Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 |
| ----- | ---------- | ---------- | ---------- | ---------- |
| Train set error | 1 % | 15% | 15% | 0.5%|
| Dev set error | 11 % | 16% | 30% | 1.0% |
| Evaluation | High var. | High bias | High bias & high var. | Low bias & low var. |

Actually, above evaluations were based on an **optimal (Bayes) error $\sim$ 0**. These evaluations can change if Bayes error is higher.

### Basic Recipe for Machine Learning

![Recipe for ML](images/recipe_for_ml.png)

Previously, the big challenges was the trade-off between bias and variance, meaning that low bias was leading to high varinace and vice versa. However, in modern ML, we seem to have solved the problem by having access to more data or better and algorithms/computing power. That's why deep learning is working well for supervised learning.

### Regularization

To reduce overfitting or variance, one way is to train the model on more data. However, access to more data is often expensive. Regularization can help here.

#### Logistic Regression Regularization

![Logestic regression regularization](images/lreg_reg.png)

Note that in common regularization practices L2 norm is used. However, some use L1 norm because they believe L1 norm produces sparse matrix which help with computation speed. However, the improvment is subtle in practice and L2 norm is recommended.

#### Neural Network L2 (Frobenious) Regularization

![Neural Network regularization](images/nn_reg.png)

**Correction:**

In above formulation of Forbenius norm, the correct formula is the following:

$$
||w^{[l]}||^2 = \sum_{i=1}^{n^{[l]}} \sum_{j=1}^{n^{[l-1]}} (w_{i,j}^{[l]})^2
$$

* The limit of summation of i should be from 1 to $n^{[l]}$.
* The limit of summation of j should be from 1 to $n^{[l−1]}$.

### Why does regularization prevent overfitting (variance)?

![NN regularization intuition1](images/nn_reg_intuition1.png)

![NN regularization intuition2](images/nn_reg_intuition2.png)

Regularization also push parameter to very small numbers. This would eliminate some of units or hidden layers, which would reduce overfitting. In addition, smaller parameters drive down $Z$ values. $tanh()$ activation function would be linear for smaller $Z$s. This would in turn make hidden layers linear and linear hidden layers will make the whole NN close to linear function and reduced overfitting.

### Dropout Regularization

The core idea of dropout is to randomly "drop out" (set to zero) a certain percentage of neurons in a layer during each training iteration. This means that for each training example, the network has a slightly different architecture because a different set of neurons is temporarily removed.

1.  **During Training:**
    * A dropout rate, typically between 20% and 50%, is chosen. This is the probability that a neuron's output will be set to zero.
    * For each training batch, a new, random set of neurons is "dropped out" from the specified layers. This effectively creates a "thinned" version of the network for that particular training step.
    * The remaining, active neurons are scaled up by a factor of $1/(1 - \text{dropout rate})$. This is known as **inverted dropout** and ensures the expected output of the layer remains consistent, which is crucial for a stable learning process.

2.  **During Inference (Testing):**
    * Dropout is **not** applied during testing. All neurons are active and used for making predictions.
    * Because the neuron activations were scaled during training, no additional scaling is needed during inference. The network's output is already on the correct scale, which makes prediction a simpler and faster process.

| | **Standard Dropout** | **Inverted Dropout** |
| :--- | :--- | :--- |
| **Training** | Drop out neurons with probability **$p$**. | Drop out neurons with probability **$p$**, and scale remaining activations by multiplying by **$1/(1-p)$**. |
| **Inference** | Use all neurons, and scale their outputs by multiplying by **$(1-p)$**. | Use all neurons, with **no scaling** required. |

---

This video provides an explanation of dropout, a key regularization technique in deep learning:

<u>[Dropout: A Method to Regularize the Training of Deep Neural Networks](https://www.youtube.com/watch?v=hbTONDwwPUk)</u>

#### Why It Helps

Dropout works by preventing neurons from becoming too specialized and co-dependent on each other. By forcing neurons to learn to function in a wide variety of different network configurations, it encourages the network to learn more robust and generalizable features. This is analogous to having multiple separate, smaller neural networks being trained simultaneously and then averaged together, a technique known as **ensemble learning**. The result is a single, more resilient model that is less likely to overfit the training data.

#### Bonus

This video provides a theoretical and practical explanation of dropout regularization, including an implementation using TensorFlow and Keras.

<u>[What is Dropout Regularization?](https://www.youtube.com/watch?v=lcI8ukTUEbo)</u>

---
![dropout regularization](images/dropout_reg.png)

---

![dropout regularization](images/inverted_dropout_reg.png)

### Understanding dropout

#### Why does dropout work?

* **Prevents Co-dependence:** Dropout randomly "knocks out" neurons during each training iteration. This forces the remaining neurons to learn more robust and independent features because they cannot rely on any single neuron, as that neuron might be dropped out.
* **Shrinks Weights:** By forcing the network to spread out the "responsibility" for learning across more neurons, dropout has a similar effect to L2 regularization, which shrinks the weights and helps prevent overfitting.

#### Implementation Details

* **Varying Dropout Rates:** You can set different "keep probabilities" (the probability a neuron is kept) for different layers.
    * Layers with a large number of parameters are more prone to overfitting, so you can apply a lower keep probability (e.g., 0.5) to these layers for stronger regularization.
    * Layers that are less likely to overfit can have a higher keep probability (e.g., 0.7) or even 1.0 (no dropout).
    * The input layer typically has a keep probability close to 1.0.

#### Practical Considerations

* **When to Use It:** Dropout is a regularization technique. It should only be used when a model is overfitting. Computer vision models, which often have a massive number of input features (pixels), frequently use dropout because they are prone to overfitting due to limited data.
* **Debugging:** Dropout makes the cost function ($J$) less well-defined because the network architecture changes with each iteration. This makes it difficult to use the cost function to debug your code. A common practice is to first run the model with dropout turned off (keep probability = 1.0) to ensure the cost function is decreasing, and then re-enable dropout.

---

![why does dropout work](images/why_dropout_works.png)

### Other Regularization Methods

#### Data Augmentation
* **The Concept:** When you don't have enough training data to prevent overfitting, you can create new, synthetic data by applying transformations to your existing training examples.
* **How It Works:** For images, you can create new examples by flipping images horizontally, taking random crops, or applying slight distortions. For other data like optical character recognition, you can apply rotations and distortions to digits.
* **Benefits:** This method is a very inexpensive way to increase the size of your training set and effectively "regularize" your model, telling it that these transformations don't change the underlying subject (e.g., a flipped cat is still a cat).

---

#### Early Stopping
* **The Concept:** This technique involves stopping the training process before the model has fully converged to its lowest training error.
* **How It Works:** You monitor both the training error and the dev set (validation) error during gradient descent. The training error will likely decrease monotonically, but the dev set error will decrease for a while and then start to increase as the model begins to overfit. Early stopping stops the training at the point where the dev set error is at its minimum.
* **Why It Works:** Early stopping effectively finds a model with a smaller parameter size (W) since it hasn't had as much time to train and grow large. This has a similar effect to L2 regularization, which also penalizes large weights.
* **Trade-offs:** Early stopping couples two separate machine learning goals: optimizing the cost function and preventing overfitting. This can make the process of tuning hyperparameters more complicated. An alternative is to use L2 regularization and train the model for a longer period, which decouples these two tasks but can be more computationally expensive (because we will need to tune regularization $\lambda$ hyperparameter using cross-validation).

---

![Early stopping](images/early_stopping.png)

### Setting up your optimization problem

#### Normalizing inputs

Normalizing input features is a crucial step to speed up a neural network's training. It involves two steps:

* **Zero the Mean:** Subtract the mean of each feature from its values so the new mean is zero.
* **Normalize the Variance:** Divide each feature's values by its variance, ensuring all features have a similar range.

This process makes the **cost function's** contours more symmetric and spherical rather than elongated. A more balanced cost function allows the **gradient descent** algorithm to converge much faster and more efficiently by taking larger, more direct steps toward the minimum instead of many small, oscillating steps.

It's important to use the same mean and variance calculated from the **training set** to normalize the **test set** as well.

#### Vanishing/Exploding Gradients

When training deep neural networks, **vanishing and exploding gradients** are problems where the gradients—used to update the network's weights—become either extremely small or extremely large, making training difficult.

* **Exploding Gradients**: Occur when weights are slightly greater than one, causing gradients to grow exponentially with each layer.
* **Vanishing Gradients**: Occur when weights are slightly less than one, causing gradients to shrink exponentially with each layer. This is especially problematic as it slows down learning.

The primary solution is a careful choice of **random weight initialization**, which helps to prevent the gradients from becoming too large or too small, allowing the network to train more effectively.

---

![Vanishing Exploding Gradients](images/exploding_vanishing_gradients.png)

#### Weight Initialization for Deep Networks

* **The Goal of Weight Initialization:** The main objective is to initialize the weights so that they are neither too large nor too small. If the weights are too large, the activations and gradients can grow exponentially (**exploding gradients**). If they are too small, the activations and gradients can shrink exponentially toward zero (**vanishing gradients**), which stops the network from learning.

* **Initialization for a Single Neuron:** For a single neuron with many input features, the sum of the products of weights and inputs ($z$) can become very large. To keep $z$ on a reasonable scale, the variance of the weights should be inversely proportional to the number of input features ($n$). A common approach is to set the variance of the weights ($W$) to be $1/n$.

* **Generalizing to Deep Networks:**
    * **ReLU Activation:** For a network using the **ReLU activation function** (the most common choice), a good practice is to set the variance of the weights for a given layer to $2/n_{in}$, where $n_{in}$ is the number of input features to that layer. The weights are then initialized from a random distribution (e.g., a standard normal distribution) and scaled by $\sqrt{2/n_{in}}$.
    * **Tanh Activation:** For a network using the **Tanh activation function**, a slightly different approach, known as **Xavier initialization**, is often used. This method scales the weights by $\sqrt{1/n_{in}}$. Other variants, such as one by Bengio et al., also exist.

* **Practical Application:** In practice, using these initialization formulas provides a strong starting point for training. While the specific constants (like 1 or 2) can be treated as hyperparameters to be fine-tuned, these standard values are generally very effective at mitigating the vanishing and exploding gradient problems, allowing for <u>faster and more stable training of deep networks</u>.
   
**What you should remember**:
- The weights $W^{[l]}$ should be initialized randomly to break symmetry. 
- However, it's okay to initialize the biases $b^{[l]}$ to zeros. Symmetry is still broken so long as $W^{[l]}$ is initialized randomly. 

---

![Weight Initialization](images/param_initialization.png)

#### Numerical Approximation of Gradients

Gradient checking is a powerful technique for debugging and verifying the correctness of a neural network's backpropagation implementation. It relies on numerically approximating the gradient and comparing it to the gradient calculated by backpropagation.

##### **The Two-Sided Difference Method**
To numerically approximate a gradient, you use the two-sided difference formula, which provides a much more accurate estimate than the one-sided method.

* **Formula**: The approximation of the derivative of a function $f(\theta)$ is calculated as:
    $\frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}$

    Here, $\theta$ is the parameter you're checking, and $\epsilon$ is a very small number (e.g., $0.01$).
* **Intuition**: Instead of measuring the slope on just one side of a point, this method measures the slope of a line that passes through two points equidistant from $\theta$. This "two-sided" approach effectively averages the slopes on either side, providing a much more precise approximation of the true derivative at that point. 

##### **Why It's Better**
The two-sided difference method is far more accurate than the one-sided method ($\frac{f(\theta + \epsilon) - f(\theta)}{\epsilon}$). While the one-sided error is proportional to $\epsilon$, the two-sided error is proportional to $\epsilon^2$. Since $\epsilon$ is a very small number, $\epsilon^2$ is even smaller (e.g., if $\epsilon = 0.01$, $\epsilon^2 = 0.0001$), resulting in a much more precise approximation.

##### **How it's Used**
* **Backpropagation Check**: To perform gradient checking, you first use your backpropagation implementation to compute the gradient of your cost function with respect to your model's parameters.
* **Numerical Check**: Then, you use the two-sided difference formula to numerically approximate the same gradient.
* **Comparison**: You compare the two results. If they are very close, you can be confident that your backpropagation code is correct. If there's a significant difference, it indicates a bug in your backpropagation implementation.

---

![Derivative Approximation](images/derivative_approximation.png)

#### Gradient Checking

Gradient checking is a powerful debugging tool used to verify that the implementation of **backpropagation** is correct. Since backpropagation involves complex derivative calculations, it's easy to make subtle errors. Gradient checking provides a way to numerically approximate the gradient and compare it to the one calculated by your code.

##### **The Procedure**
1.  **Reshape Parameters:** First, take all the parameters of your neural network (**$W$** and **$b$** for all layers) and reshape them into a single, giant vector called **$\theta$**.
2.  **Reshape Gradients:** Similarly, take all the gradients calculated by backpropagation (**$dW$** and **$db$**) and reshape them into a single vector called **$d\theta$**. This vector should have the same dimensions as **$\theta$**.
3.  **Numerical Approximation Loop:** For each element $\theta_i$ in the vector **$\theta$**, use the **two-sided difference** method to numerically approximate its gradient. This is done by calculating:
    $$
    \frac{J(\theta_1, ..., \theta_i+\epsilon, ...) - J(\theta_1, ..., \theta_i-\epsilon, ...)}{2\epsilon}
    $$
    This process is repeated for every element in **$\theta$**, creating a new vector of approximated gradients, **$d\theta_{approx}$**.
5.  **Compare the Gradients:** Finally, compare the numerically approximated gradient vector ($d\theta_{approx}$) to the one calculated by your backpropagation code ($d\theta$). A good way to measure the difference is to use the following formula:

    $$
    \frac{\|d\theta_{approx} - d\theta\|_2}{\|d\theta_{approx}\|_2 + \|d\theta\|_2}
    $$

    This formula calculates the Euclidean distance between the two vectors and normalizes it by their combined lengths.

##### **Interpreting the Results**
* **Result is < $10^{-7}$**: This is an excellent result and strongly suggests that your backpropagation implementation is correct.
* **Result is around $10^{-5}$**: This might be acceptable, but you should double-check your code, as there could be a small bug.
* **Result is > $10^{-3}$**: This indicates a significant problem, and it's almost certain that there is a bug in your backpropagation code. You should debug your implementation until this value becomes very small.

By using gradient checking, you can quickly find and fix bugs that would otherwise be very difficult to locate, saving a great deal of time and effort. 

#### Gradient Checking Implementation

Here are some key tips for effectively implementing and using **gradient checking** to debug your neural network.

* **Use for Debugging Only:** Gradient checking is computationally very slow, so you should **only use it for debugging** your backpropagation code. After you've confirmed your backpropagation is correct, you should turn off gradient checking and rely solely on backpropagation for training.

* **Pinpoint the Bug:** If the gradient check fails, examine the individual components of the numerically approximated gradient vector and the backpropagation gradient vector. By identifying which specific parameters (e.g., a specific layer's **$W$** or **$b$** values) have a large discrepancy, you can narrow down the location of the bug in your code.

* **Include Regularization:** If your cost function includes a regularization term (e.g., L2 regularization), remember to include this term when computing the cost during gradient checking. The gradient you're checking against must also include the derivative of the regularization term.

* **Don't Use with Dropout:** Gradient checking is not compatible with **dropout** because dropout randomly deactivates neurons in each iteration. This means the cost function is not fixed but is constantly changing, making it impossible to perform a stable numerical approximation. To use gradient checking, you should temporarily turn off dropout (by setting the keep probability to 1.0) and verify your code without it. Then, you can turn dropout back on.

* **Test at Multiple Points:** It is rare, but sometimes a bug in backpropagation only appears when the weights (**$W$** and **$b$**) are far from their initial small values. To be thorough, you can run a gradient check not only at the beginning of training (with random initialization) but also after training for a number of iterations. 