## Lecture Notes: SGD with Momentum

### 1. Introduction to Optimisation Techniques in Deep Learning

*   Deep learning involves training neural networks, often for regression problems where inputs are used to predict outputs.
*   The performance of a neural network is evaluated using a **loss function**, which quantifies the difference between predictions and actual values (e.g., Mean Squared Error: `(y - y_hat)^2`).
*   To improve network performance (reduce loss), the **weights** and **biases** of the neural network are adjusted.
*   The **loss** is a function of these weights and biases, and the goal of optimisation is to find the values for these parameters that **minimise the loss**.
*   **SGD with Momentum** is a crucial optimisation technique that **speeds up neural network training** and introduces the concept of "momentum," which is vital in further advanced optimisation techniques.

### 2. Visualising Loss Functions

To understand optimisation, it's helpful to visualise loss functions:
*   **2D Plot**: Represents loss as a function of a **single parameter** (e.g., one weight `W`). It shows how changes in `W` affect the loss.
*   **3D Plot**: Represents loss as a function of **two parameters** (e.g., a weight `W` and a bias `B`). This creates a 3D surface. However, visualising more than two parameters is impossible for humans.
*   **Contour Plot**: A 2D projection of a 3D loss surface, viewed from above.
    *   **Concentric circles/ellipses** on a contour plot represent points with the **same loss value (altitude)**.
    *   **Colours** are used to represent the third dimension (height/loss value), with different colours (e.g., yellow/orange for high, blue/purple for low) indicating varying altitudes.
    *   Closely spaced contours indicate a steep slope, while widely spaced contours suggest a flatter surface.

### 3. Challenges in Deep Learning Optimisation (Non-Convex Problems)

*   Complex neural networks often result in **non-convex optimisation problems**, meaning the loss function surface is intricate and not a simple bowl shape.
*   Finding the true minimum (global minimum) in non-convex landscapes is difficult due to three primary reasons:
    *   **Local Minima**: Points where the slope is zero, but it's not the lowest possible loss. Algorithms can get stuck here, leading to sub-optimal solutions.
    *   **Saddle Points**: Points where the surface rises in one direction and falls in another. At saddle points, the slope changes very gradually over a large area, causing optimisers to **slow down significantly** and prolong training.
    *   **High Curvature**: Regions where the loss surface changes very rapidly, making it difficult for standard optimisers to navigate effectively.

### 4. Limitations of Standard Gradient Descent

*   **Batch Gradient Descent (BGD)**: Updates weights after processing the entire dataset. It reaches the minimum smoothly but can be slow for large datasets.
*   **Stochastic Gradient Descent (SGD)**: Updates weights after seeing just one data point. Its path to the minimum is erratic and noisy but can be faster than BGD.
*   **Mini-Batch Gradient Descent**: Updates weights after a small batch of data, offering a balance between BGD and SGD.
*   These "vanilla" gradient descent algorithms struggle with the challenges of non-convex optimisation (local minima, saddle points, high curvature), leading to either **slow training** or **sub-optimal solutions**.

### 5. Why Use Momentum Optimisation? (Problems it Solves)

Momentum is designed to navigate the complexities of non-convex loss functions by addressing:
*   **High Curvature**: Helps traverse areas where the loss changes rapidly.
*   **Consistent Gradients (Slow Slope Changes)**: Prevents getting stuck or moving slowly in areas with gradual slopes (like saddle points). It builds up speed to move past such regions.
*   **Noisy Gradients**: Helps escape local minima by providing enough "speed" to overcome small "ditches" in the loss landscape.

### 6. Core Principle and Intuition of Momentum

*   The fundamental idea behind momentum optimisation is to **accelerate movement in consistent directions**.
*   **Intuition (Car Analogy)**: If multiple previous directions suggest moving towards a point `B`, you gain confidence and increase speed in that direction.
*   **Intuition (Physics Analogy)**: Imagine a ball rolling down a surface. As it descends, it gathers momentum, increasing its speed. This algorithm is named "momentum" because its mathematics mimics Newtonian physics.
*   The **single most important benefit** of using momentum optimisation is **speed**, making it faster than normal SGD 99% of the time.
   <img src="https://www.researchgate.net/publication/333469047/figure/fig1/AS:764105438793728@1559188341202/The-compare-of-the-SGD-algorithms-with-and-without-momentum-Take-Task-1-as-example-The.png">
### 7. Mathematical Implementation of Momentum

Momentum is implemented by taking an **exponentially weighted moving average (EWMA)** of past gradients.
*   **Normal SGD Weight Update**:
    *   `W_t = W_t-1 - η * ∇L(W_t-1)`
    *   Where:
        *   `W_t` is the weight at time `t`
        *   `W_t-1` is the weight at the previous time step
        *   `η` (eta) is the learning rate
        *   `∇L(W_t-1)` is the gradient of the loss function with respect to `W` at `W_t-1`

*   **SGD with Momentum Weight Update**:
    1.  **Calculate Velocity (Momentum Term)**:
        *   `V_t = β * V_t-1 + η * ∇L(W_t-1)`
        *   `V_t` is the velocity at time `t`.
        *   `V_t-1` is the velocity at the previous time step.
        *   `β` (beta) is the **decay factor** (also called momentum coefficient), typically a value between 0.7 and 0.99, often 0.9.
        *   This `V_t` term essentially accumulates the history of past gradients, providing the "momentum" or acceleration.
    2.  **Update Weights**:
        *   `W_t = W_t-1 - V_t`
        *   Instead of subtracting the current gradient, we subtract the calculated velocity `V_t`.
<img src="https://i.ibb.co/mV3yX7rx/Screenshot-2025-09-18-095905.png">

### 8. The Role of Beta (β) in Momentum

*   `β` is a crucial hyperparameter that dictates how much contribution past velocities have to the current update.
*   **If `β = 0`**: The `β * V_t-1` term becomes zero, and the velocity update simplifies to `V_t = η * ∇L(W_t-1)`. When plugged into the weight update, this becomes `W_t = W_t-1 - η * ∇L(W_t-1)`, which is the **formula for normal SGD**. Thus, with `β = 0`, momentum acts like normal SGD.
*   **If `β = 1`**: There would be no decay of past velocities, leading to a dynamic equilibrium where the "ball" might keep oscillating indefinitely without settling. This is generally not desired.
*   **Typical values for `β`** range from `0.9` to `0.99` or even `0.5`.
*   The `β` value determines the "weight" given to older gradients; older gradients decay exponentially, having less impact on the current velocity. A common approximation is that the current velocity is an average over `1 / (1 - β)` past velocities.

### 9. Benefits and Disadvantages of Momentum (Visualised)

*   **Benefits**:
    *   **Faster Speed**: Momentum allows the optimiser to gain speed, reaching the minimum faster than vanilla gradient descent. The "ball" accelerates as it rolls down the loss surface.
    *   **Escape Local Minima**: The accumulated momentum helps the optimiser "jump out" of shallow local minima to continue towards the deeper global minimum.
    *   **Smoother Convergence Path**: By reducing excessive vertical oscillations and increasing horizontal movement, momentum leads to a more direct and faster path to convergence.

*   **Disadvantage**:
    *   **Oscillation around the Minimum**: Due to the accumulated momentum, the algorithm can **overshoot the global minimum** and then oscillate back and forth multiple times before settling down. This "wobbling" can slow down convergence *after* reaching the vicinity of the minimum, making it not the absolute fastest optimiser in all scenarios (though still generally faster than vanilla SGD).
    *   "The biggest problem with momentum is momentum itself".