## **Optimizers in Deep Learning: RMSProp**

### **1. Introduction and Definition**

*   **RMSProp** stands for **Root Mean Square Prop**.
*   It is an optimization technique that serves as an **improvement over AdaGrad** (Adaptive Gradient Algorithm).

### **2. Motivation: The Problem with AdaGrad**

RMSProp addresses a major disadvantage of the AdaGrad algorithm:

*   **Use Case Context:** AdaGrad is typically used when dealing with **sparse data** (data where some columns contain many zero values). Sparse data results in an **elongated contour plot** of the loss function.
*   **AdaGrad's Initial Role:** Standard optimizers (like standard Gradient Descent or Momentum) struggle with these stretched contours, leading to slow, inefficient movements. AdaGrad was introduced to solve this specific problem.
*   **AdaGrad's Failure to Converge:** Despite its initial success, AdaGrad’s primary disadvantage is that it **reduces the learning rate (LR)**. After a certain point, the learning rate becomes so small that the resulting updates are very minor.
*   **Result of Small Updates:** Due to extremely small updates, the AdaGrad algorithm may fail to converge completely, stopping prematurely and failing to reach the global minimum.

### **3. Mathematical Basis of AdaGrad's Failure**

The root cause of AdaGrad's convergence failure is how it calculates the divisor used in the update rule:

*   **Learning Rate Division:** The learning rate for a parameter (e.g., $B$) is divided by a term, $V_T$.
*   **$V_T$ Calculation:** In AdaGrad, $V_T$ is calculated as the **sum of the squares of all past gradients**. This includes every gradient from the start up to the current epoch.
*   **The Issue:** Over many epochs, $V_T$ becomes **very large**. When a large $V_T$ divides the learning rate, the resulting overall update term becomes **very, very small** (almost negligible), which halts movement in that direction.

### **4. RMSProp Mechanism and Mathematics**

RMSProp's core change is designed to prevent $V_T$ from growing excessively large.

*   **The Key Change:** RMSProp stops considering the entire history of gradients. Instead, it uses an **Exponentially Weighted Average** (or Exponentially Decaying Average) when calculating $V_T$.
*   **Formula for $V_T$ (RMSProp):**
    $$V_T = \beta \times V_{T-1} + (1 - \beta) \times (\text{Gradient})^2$$
*   **Beta ($\beta$) Value:** The value for $\beta$ is generally set to **0.95**, although it can be changed.
*   **Impact of Exponential Averaging:**
    1.  This method ensures that **more value is given to recent gradients**, while older gradients are "forgotten" (their influence is significantly down-weighted).
    2.  As steps progress (epochs 1, 2, 3...), the gradient value from the earliest epoch (epoch 1) is multiplied by multiple terms that are less than one, making its contribution progressively smaller compared to the most recent gradient.
    3.  $V_T$ is thereby prevented from "shooting up" (becoming overly large).
    4.  Since $V_T$ remains reasonably sized, the learning rate **does not become too small**, and updates continue, allowing the algorithm to **converge completely**.

### **5. Performance and Disadvantages**

*   **Convex vs. Non-Convex Problems:**
    *   In **convex optimization problems** (e.g., Linear Regression), AdaGrad and RMSProp behave similarly, and AdaGrad also converges to the global minimum.
    *   In **non-convex optimization problems** involving complex neural networks, AdaGrad struggles to reach the global minimum, but **RMSProp converges successfully**.
*   **Empirical Success:** RMSProp is empirically proven to be **one of the best optimization techniques** for neural networks.
*   **Historical Context:** Before the introduction of Adam, RMSProp was the optimizer most commonly used for neural networks.
*   **Current Status:** RMSProp still competes with Adam today and is used when Adam does not produce sufficient results.
*   **Disadvantages:** RMSProp is stated to have **no disadvantages**.

---

In [None]:
from tensorflow.keras.optimizers import RMSprop, Adagrad

model.compile(
    optimizer=RMSprop(learning_rate=0.001, rho=0.9),  
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

model.compile(
    optimizer=Adagrad(learning_rate=0.01, epsilon=1e-07),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)