### **Optimizers in Deep Learning**

This section of the Deep Learning course focuses on **Optimizers**, a crucial topic for improving neural network performance.

**1. Context and Importance**
*   Previous videos focused on improving neural network performance, specifically **speeding up training**.
*   Training deep neural networks can be very time-consuming, especially with many hidden layers.
*   Previous techniques to speed up training include:
    *   Weight Initialisation
    *   Batch Normalisation
    *   Choice of Activation Function
*   **Optimizers** are presented as the **fourth and arguably most important technique** for increasing the training speed of neural networks.
*   The goal is to understand how optimizers increase neural network training speed.

**2. The Role of an Optimizer in a Neural Network**
*   A neural network's primary function is to **find optimal values for its weights and biases** (parameters).
*   Example: A simple neural network with 9 parameters (weights and biases).
*   When data is fed into the network, it produces a prediction.
*   The objective is to find parameter values such that the **predictions are very close to the actual (real) results**.
*   This is achieved by **minimising the "loss function"** (or "cost function"), which represents the difference between the predicted output (Y-hat) and the real output (Y).
*   **Process:**
    1.  Start with **random initial values** for weights and biases.
    2.  Gradually **improve these values** to reach the "correct" or "optimum" values.
*   **Graphical Representation (Loss Surface):**
    *   The loss function can be visualised as a multi-dimensional graph (e.g., 3D if there are two weights and one loss axis).
    *   The goal is to navigate this loss surface from a starting random point to the **global minimum**, where the loss is lowest.
    *   The weights corresponding to this global minimum are the optimum parameters for the neural network.
    *   **An optimizer is required to perform this task** of finding the minimum loss and corresponding optimum weight values.

**3. Gradient Descent as an Optimizer**
*   The primary optimizer used in deep learning is **Gradient Descent**.
*   **Update Rule:** Gradient Descent uses a simple rule to update weights iteratively:
    *   `new_weight = old_weight - learning_rate * gradient_of_loss_wrt_weight`
    *   The **learning rate** determines the size of the steps taken.
    *   Gradient Descent is designed to take **larger jumps when far from the minimum** and **smaller jumps as it approaches the minimum**.
*   **Types of Gradient Descent (already covered in previous videos, but briefly reviewed):**
    *   **Batch Gradient Descent (BGD):**
        *   Updates weights **once per epoch** (one full pass over the entire dataset).
        *   Calculates predictions and loss over the **entire dataset** (e.g., 5000 rows) before updating weights.
        *   Results in fewer weight updates per epoch.
    *   **Stochastic Gradient Descent (SGD):**
        *   Updates weights **after seeing each individual data point**.
        *   If there are 5000 data points, weights are updated 5000 times per epoch.
        *   Results in many weight updates, making it computationally intensive but potentially faster at escaping local minima.
    *   **Mini-batch Gradient Descent (MBGD):**
        *   A compromise between BGD and SGD.
        *   Updates weights after processing a **small batch of data points** (e.g., batch size of 100).
        *   The dataset is divided into mini-batches, and weights are updated after each mini-batch.
*   A clear understanding of these Gradient Descent types is crucial for grasping new optimizers.

**4. Problems/Challenges with Conventional Gradient Descent**
Despite its utility, conventional Gradient Descent (BGD, SGD, MBGD) faces several challenges:

*   **Challenge 1: Difficulty in Setting the Learning Rate**
    *   The learning rate is a **critical hyperparameter**.
    *   **If the learning rate is too small:**
        *   **Convergence is extremely slow**, as steps are tiny.
        *   May take too long to reach the minimum, or might not reach it at all.
    *   **If the learning rate is too large:**
        *   The optimizer **overshoots the minimum**, leading to oscillations or even divergence.
        *   Training can become **unstable**, moving further away from the minimum.
    *   Finding the optimal learning rate is a **difficult and dataset-dependent task**.

*   **Challenge 2: Limitations of Learning Rate Scheduling**
    *   **Learning rate scheduling** was introduced to mitigate the problem of a fixed learning rate.
    *   It involves **automatically changing or reducing the learning rate** based on a predefined schedule or when certain objectives are met.
    *   **Problem:** The schedule or thresholds must be **predefined before training**.
    *   This means the scheduling is **not adaptive to the specific dataset** during training.
    *   A schedule that works well for one dataset might not perform optimally for another.

*   **Challenge 3: Using a Single Learning Rate for Multiple Parameters**
    *   In a neural network, there are often **many weight and bias parameters** (e.g., 9 parameters in the example).
    *   This implies a high-dimensional loss surface.
    *   Conventional Gradient Descent applies a **single, uniform learning rate** across **all dimensions/parameters**.
    *   **Restriction:** It **cannot have separate learning rates** for different weight parameters.
    *   This is problematic because different directions on the loss surface might require different step sizes (e.g., faster movement in one direction, slower in another) for efficient convergence.

*   **Challenge 4: Getting Stuck in Local Minima**
    *   The loss functions of complex neural networks often have **multiple local minima**, in addition to the global minimum.
    *   The ultimate goal is to reach the **global minimum** (the point of lowest possible loss).
    *   Conventional Gradient Descent algorithms have a **high probability of getting trapped in a local minimum**.
    *   If stuck in a local minimum, the algorithm converges to a **sub-optimal solution**.
    *   Stochastic Gradient Descent (SGD) has a *slight* potential to escape local minima due to its noisy updates.

*   **Challenge 5: Encountering Saddle Points**
    *   A **saddle point** is a region on the loss surface where the slope is positive in one direction but negative in another.
    *   This creates a large, flat region where the **slope (gradient) is close to zero** in all directions.
    *   If the gradient is zero, the weight update rule will result in **no change to the weights** (`new_weight = old_weight`).
    *   This causes the algorithm to get stuck, even though it's not an optimal solution.
    *   Conventional Gradient Descent algorithms **cannot effectively navigate or escape saddle points**.

**5. The Need for New Optimizers**
*   These challenges (slow training, sub-optimal results, difficulty with hyperparameter tuning) highlight the limitations of conventional Gradient Descent.
*   Therefore, **new optimizers** have been developed to address these problems and improve training efficiency and effectiveness.
*   These new optimizers are generally **improvements or modifications to the core Gradient Descent algorithm**, rather than entirely new concepts.
*   **Upcoming Optimizers to be covered:**
    *   Momentum
    *   Adagrad
    *   RMSprop
    *   Adam (most widely used)
*   To understand these new optimizers, the concept of **Exponentially Weighted Moving Average** will be covered next, as it is foundational to many of them.

***