## Lecture Notes: SGD with Momentum & Nesterov Accelerated Gradient (NAG)

### 5. Nesterov Accelerated Gradient (NAG)

#### 5.1. Introduction to NAG
*   **Nesterov Accelerated Gradient (NAG)** is an optimisation technique that represents a **small but significant upgrade to Momentum**.
*   It generally **performs better than Momentum**.
*   The primary goal of NAG is to **reduce the oscillations** observed in Momentum, thereby speeding up convergence.

#### 5.2. Intuition and Geometric Interpretation of NAG (Look-Ahead Gradient)
*   NAG's core idea is to be more **"intelligent"** or **"foresighted"** about where to calculate the gradient.
*   Instead of calculating the gradient at the current position (`W_t-1`), NAG **first applies the momentum term** to project to a **"look-ahead" point**.
*   The gradient is then calculated at this **projected look-ahead point**, and this look-ahead gradient is used to update the weights.
*   **Analogy**: Imagine a ball rolling down a hill. Momentum takes a step based on its current speed and the slope right where it is. NAG, however, first predicts where the momentum *alone* would take it next, then it calculates the slope *at that predicted future position*, and uses *that* slope to guide its step. This allows it to "anticipate" the terrain and adjust its path sooner, reducing overshooting.
*   This "look-ahead" mechanism helps in **damping the oscillations** that are characteristic of standard Momentum.

#### 5.3. Mathematical Implementation of NAG
NAG modifies the Momentum update by calculating the gradient at a predicted future position:
1.  **Calculate Look-Ahead Point (`W_lookahead`)**:
    *   `W_lookahead = W_t-1 - β * V_t-1`
    *   This step projects the current weights by applying only the accumulated momentum (velocity) from the previous step.
2.  **Calculate Velocity (`V_t`) using Look-Ahead Gradient**:
    *   `V_t = β * V_t-1 + η * ∇L(W_lookahead)`
    *   Notice that the gradient `∇L` is calculated at `W_lookahead` (the future position), not `W_t-1` (the current position).
3.  **Update Weights (`W_t`)**:
    *   `W_t = W_t-1 - V_t`
    *   The final weight update is still applied from the current position using the newly calculated velocity.

#### 5.4. Advantages of NAG
*   **Reduced Oscillations**: By anticipating the future gradient, NAG is better at slowing down as it approaches the minimum, significantly reducing the overshooting and oscillations seen in Momentum.
*   **Faster Convergence**: Due to reduced oscillations, NAG can converge faster to the minimum than Momentum, especially on complex loss surfaces.
*   It is an improvement upon the basic Momentum technique.

#### 5.5. Disadvantage of NAG
*   **Risk of Getting Stuck in Local Minima**: The damping of oscillations, while generally beneficial, can be a disadvantage in certain scenarios. If the loss landscape has shallow local minima, NAG's reduced momentum might prevent it from gaining enough speed to "jump over" these minima, potentially causing it to get stuck. Momentum, with its higher oscillations, might have sufficient momentum to escape such points. In such cases, other optimisers might be more suitable.

### 6. Implementation in Keras

Both SGD with Momentum and NAG can be easily implemented in Keras using the `SGD` class.
```python
import tensorflow as tf
tf.keras.optimisers.SGD(learning_rate=0.01,momentum=0,nestrov=False,name="SGD",**kwargs)
```

*   **For standard SGD**:
    *   Set `momentum = 0`.
    *   Set `nesterov = False`.

*   **For SGD with Momentum**:
    *   Provide a value for `momentum` (e.g., `0.9`).
    *   Set `nesterov = False`.

*   **For Nesterov Accelerated Gradient (NAG)**:
    *   Provide a value for `momentum` (e.g., `0.9`).
    *   Set `nesterov = True`.

***

Would you like to review the mathematical formulations of Momentum and NAG again, perhaps with an emphasis on their differences?