## **Optimizers in Deep Learning: Adam (Adaptive Moment Estimation)**

### **1. Definition and Significance**

*   **Adam** stands for **Adaptive Moment Estimation**.
*   It is currently considered the **most powerful optimization technique**.
*   Adam is the most famous and widely used optimization technique.
*   It is the standard choice for optimising Artificial Neural Networks (ANN), Convolutional Neural Networks (CNN), or Recurrent Neural Networks (RNN).

### **2. Core Principles: Combining Two Key Ideas**

Adam's strength comes from borrowing and combining successful concepts from preceding optimizers. The algorithm is based on combining two main ideas:

1.  **Momentum:** The concept of using past gradients for the current update to accelerate descent.
2.  **Learning Rate Decay/Adaptive Scaling:** The concept derived from AdaGrad and RMSProp, which allows the learning rate to be adapted for each parameter, overcoming issues related to sparse data.

**Historical Context of Predecessors:**

*   Standard techniques like Batch Gradient Descent are often too slow.
*   **Momentum** improved speed but caused oscillations around the minimum.
*   **NAG (Nesterov Accelerated Gradient)** introduced dampening to reduce these oscillations.
*   **AdaGrad** was developed to handle sparse data/columns, allowing direct descent on stretched loss contours.
*   However, AdaGrad's flaw was that the learning rate decayed too rapidly, causing updates to become negligible, potentially leading to a failure to reach the global minimum.
*   **RMSProp** solved this problem by preventing the learning rate from becoming too small, ensuring complete convergence.

By combining the **Momentum** concepts (like NAG) and the adaptive **Learning Rate Decay** concepts (like RMSProp), Adam generates a powerful hybrid algorithm.

### **3. Mathematical Formulation**

The full Adam algorithm involves calculating estimates of the first moment (mean of the gradients, $M_T$) and the second moment (uncentred variance of the squared gradients, $V_T$).

#### **A. Calculating the Moments**

Adam uses an Exponentially Weighted Average (similar to RMSProp) to calculate $M_T$ and $V_T$, meaning more weight is given to recent gradients.

*   **First Moment ($M_T$):** This term handles the **Momentum** component.
    $$\mathbf{M_T} = \beta_1 \times M_{T-1} + (1 - \beta_1) \times (\text{Gradient of Weight Vector})$$
*   **Second Moment ($V_T$):** This term handles the **Adaptive Learning Rate** component, similar to how $V_T$ is used in RMSProp and AdaGrad.
    $$\mathbf{V_T} = \beta_2 \times V_{T-1} + (1 - \beta_2) \times (\text{Gradient})^2$$

#### **B. Recommended Hyperparameters**

*   The general value for $\beta_1$ is **0.9**.
*   The general value for $\beta_2$ is **0.99**.
*   These beta values are configurable parameters that can be changed.
*   The main hyperparameter that often requires less manual tuning in Adam is the learning rate, as it is handled automatically.

#### **C. Bias Correction**

*   Bias correction is applied after calculating $M_T$ and $V_T$.
*   **Purpose:** $M_T$ and $V_T$ both start with initial zero values ($M_0 = 0, V_0 = 0$). This initial zero value introduces a bias in the beginning of the optimisation process which must be offset or changed.

#### **D. Weight Update Formula**

The final weight vector update formula includes the learning rate ($\alpha$), the bias-corrected moment terms ($\hat{M}_T$ and $\hat{V}_T$), and a small constant ($\epsilon$):
$$W_{T+1} = W_T - \alpha \times \frac{\hat{M}_T}{\sqrt{\hat{V}_T} + \epsilon}$$
*(Note: $\alpha$ is the Learning Rate; the source provides the key components and their order of calculation but the final weight update formula is shown to the user as a whole, including $W_{T+1}$ on the left-hand side).*

### **4. Performance and Best Practices**

*   Adam's animation demonstrates both the characteristics of **Momentum** and **Learning Rate Decay** behaviour.
*   Adam performs exceptionally well when working with **complex neural networks** where **non-convex optimization** is occurring.
*   It often converges faster than its predecessors.
*   **Starting Point:** Adam is recommended as a **good starting point** for deep learning projects.
*   **Alternative Optimizers:** If Adam does not yield satisfactory results, the next logical choices for testing are usually **RMSProp** or, occasionally, **Momentum**.
*   There is no single "clear answer" for which optimizer is universally best; results depend on the specific data, necessitating hyperparameter tuning and testing.

***