## AdaGrad: Adaptive Gradient Algorithm

### 1. Introduction to AdaGrad
*   **AdaGrad** stands for **Adaptive Gradient**.
*   Its main principle is to **not keep the learning rate fixed** but to change or update it according to the situation. This means AdaGrad uses an **adaptive learning rate**.

### 2. When AdaGrad Performs Well
AdaGrad performs better than other optimisation algorithms in specific scenarios:
*   **Input features with highly different scales**. For example, data with CGPA (0-10) and Salary (0-£200,000+). Although, in such cases, data normalisation is often performed beforehand.
*   **Sparse input features**. This is considered a more useful application area for AdaGrad.
    *   **Sparse data** means that most values in the input features are zero.
    *   **Example**: A dataset with IQ, CGPA, and an "IIT" column (0 or 1, depending on whether the person is from IIT) to predict a package. Since very few people in India attend IIT, the "IIT" column would predominantly contain zero values, making it a sparse feature. This "IIT" column is important because IIT graduates often have higher packages.

### 3. The "Long-Elongated Valley Problem" with Sparse Data
*   **Problem with Sparse Features**: Sparse features lead to a phenomenon called the **long-elongated valley problem** in the loss landscape.
    *   **Loss Graph Visualisation**:
        *   For normal input data (non-sparse), the loss contour plot (loss with respect to parameters like W and B) tends to be **circular**.
        *   For sparse input data, the loss contour plot becomes **highly elongated**. One direction (e.g., 'M' side for weights) is stretched, while the other (e.g., 'B' side for bias) is more circular.
    *   **3D Representation**: In a 3D plot, with sparse data, you observe significant movement along one axis but the gradient remains almost constant along another axis.
*   **Failure of Standard Optimisers (Batch Gradient Descent and Momentum)**:
    *   In an elongated valley, standard optimisation algorithms like **Batch Gradient Descent (BGD)**, **Momentum**, and **Nesterov Accelerated Gradient (NAG)** do not perform well.
    *   **Batch Gradient Descent (BGD)**: It wastes time by moving predominantly in one direction first (e.g., the 'B' (bias) direction) and then slowly moving in the other direction (e.g., the 'M' (weight) direction) towards the minimum. It does not take the most direct path.
    *   **Momentum**: It moves rapidly and can even **overshoot** in one direction before eventually returning and moving towards the minimum. This is also not the most efficient path.
    *   **Why does this happen with sparse data?**
        *   Consider a simple neural network with inputs `x` (sparse) and `1` (for bias), weights `W`, and bias `B`.
        *   The **gradient calculation** involves multiplying the input feature by `(y_pred - y_true)`.
        *   For a sparse feature (like `x`), its value is **zero many times**.
        *   When `x` is zero, the term `(y_pred - y_true) * x` will be zero.
        *   Over many data rows (e.g., 100 rows), when summing these derivatives, the presence of many zeros from the sparse feature `x` results in a **very small cumulative derivative** for `W`. Consequently, the **weight update for sparse features will be very small**.
        *   In contrast, for the bias term `B`, the input is always `1` (non-sparse). Therefore, the derivative `(y_pred - y_true) * 1` will **never be zero** due to the input. This leads to a **large or normal update** for `B`.
        *   **Result**: With large updates in one direction (bias) and small updates in the other (weights for sparse features), the model tends to move mostly in the direction of the non-sparse feature, leading to the inefficient, elongated path.

### 4. AdaGrad's Solution: Adaptive Learning Rates
*   **The Core Idea**: AdaGrad addresses this by using **different learning rates for different parameters** (weights and biases).
*   **Intuition**:
    *   The problem arises from the gradients. The solution lies in manipulating the learning rate.
    *   The goal is to provide **comparable updates** for all parameters.
    *   If a parameter's gradient is **small** (due to sparsity), AdaGrad makes its effective learning rate **larger**.
    *   If a parameter's gradient is **large** (normal behaviour), AdaGrad makes its effective learning rate **smaller**.

### 5. AdaGrad Mathematical Formula
The update rule for a parameter `W` at time step `t+1` is:
**`Wt+1 = Wt - (Learning Rate / sqrt(Vt + epsilon)) * Gradient(Wt)`**

Where:
*   `Wt`: Current value of the parameter (weight or bias).
*   `Wt+1`: Updated value of the parameter.
*   `Learning Rate`: The initial global learning rate.
*   `Gradient(Wt)`: The gradient of the loss function with respect to the parameter `W` at time `t`.
*   **`epsilon`**: A **very small number** (e.g., 1e-8) added to the denominator to prevent division by zero if `Vt` is zero. It has no other significant importance.
*   **`Vt`**: Represents the **sum of squared past gradients** for that specific parameter.
    *   It is calculated iteratively: **`Vt = Vt-1 + (Gradient(Wt))^2`**.
    *   The square is used to ensure all contributions are positive (magnitude) and to avoid sign changes, effectively considering the magnitude of gradients.
*   **How `Vt` works**:
    *   If a parameter has a **large gradient**, `Vt` (the sum of past squared gradients) will **grow quickly and become a large number**.
    *   Dividing the global `Learning Rate` by `sqrt(Vt)` will make the **effective learning rate small**, thereby reducing the update for that parameter.
    *   If a parameter has a **small gradient** (e.g., due to sparsity), `Vt` will **grow slowly and remain a small number**.
    *   Dividing the global `Learning Rate` by `sqrt(Vt)` will result in a **relatively larger effective learning rate**, allowing for a larger update for that parameter.
    *   This mechanism ensures that **updates for all parameters become more comparable**, leading to a more direct path to the minimum.
*   **Visual Outcome**: When AdaGrad is applied to sparse data, it shows **clear improvement** and reaches the global minimum more quickly and efficiently compared to Batch Gradient Descent or Momentum.

### 6. Disadvantages of AdaGrad
*   AdaGrad has a significant drawback which limits its use, especially in complex neural networks.
*   **Failure to Converge**: AdaGrad can reach close to the solution but **often fails to fully converge to the global minimum**.
*   **Reason**: As training progresses and the number of epochs increases, `Vt` (the sum of all past squared gradients) **continuously accumulates and becomes very large**.
    *   This ever-increasing `Vt` causes the **effective learning rate `(Learning Rate / sqrt(Vt + epsilon))` to become extremely small**.
    *   Consequently, the **parameter updates become negligible**, effectively stopping the learning process before the global minimum is reached.
*   **Usage**: Due to this disadvantage, AdaGrad is typically **not used for complex neural networks**. It can be suitable for simpler problems like **linear regression**.
*   **Importance**: Despite its limitations, the adaptive learning rate concept introduced by AdaGrad is **fundamental** and serves as a basis for more advanced optimisers like **RMSProp and Adam**.