### **Exponentially Weighted Moving Average (EWMA)**

This section introduces the concept of **Exponentially Weighted Moving Average (EWMA)**, a crucial technique for understanding advanced optimizers in deep learning.

**1. Introduction to EWMA**
*   EWMA is an important concept that needs to be understood *before* delving into specific optimizers.
*   It is a technique used to **find trends in time-series based data**.
*   Examples of time-series data include daily stock fluctuations or daily temperature records for a city.
*   The goal is to extract the underlying **pattern or trend** from such data.
*   Visually, the EWMA often appears as a smoother curve that "blends" with the data, unlike a simple average which might appear as a straight line.

**2. Applications of EWMA**
*   **Time-series forecasting**.
*   **Company financial forecasting**.
*   **Signal processing** in science.
*   **Deep Learning:** It is used while building some **optimizers**. Specifically, it is foundational for optimization techniques like **Momentum**.

**3. Key Principles of EWMA Calculation**
EWMA adheres to two main principles when calculating the average:
1.  **Later points are given more weight:** When calculating EWMA, **more recent data points are weighted more heavily** compared to older data points. For instance, a data point from Day 5 will have more weight than a data point from Day 3.
2.  **Weight decreases over time:** The weight of any given data point **reduces as time passes**.

**4. Mathematical Formula of EWMA**
The formula for Exponentially Weighted Moving Average at a particular time `t` (`Vt`) is:
`Vt = β * Vt-1 + (1 - β) * Tt`

Where:
*   `Vt`: The EWMA at time `t`.
*   `β (Beta)`: A constant value between 0 and 1 (0 < β < 1).
*   `Vt-1`: The EWMA from the previous time step (`t-1`).
*   `Tt`: The actual data point (e.g., temperature) at time `t`.

**5. Step-by-Step Calculation Example**
Let's assume `V0 = 0` (though some sources might directly set `V0 = T0`) and a common `β` value of **0.9** (often used in deep learning):

*   **V0**: Can be set to 0 or equal to T0 (the first data point). Let's assume V0 = 0 for this example.
*   **V1**: `V1 = β * V0 + (1 - β) * T1`
    *   If `β = 0.9` and `T1 = 30` (from example data):
    *   `V1 = 0.9 * 0 + (1 - 0.9) * 30 = 0.1 * 30 = 3`.
*   **V2**: `V2 = β * V1 + (1 - β) * T2`
    *   If `β = 0.9`, `V1 = 3`, and `T2 = 17` (from example data):
    *   `V2 = 0.9 * 3 + 0.1 * 17 = 2.7 + 1.7 = 4.4`.
*   This process continues for all subsequent data points (`V3, V4`, etc.), connecting the calculated `Vt` values to form the EWMA graph.

**6. Impact of the `β` (Beta) Parameter**
The value of `β` significantly affects the behavior and smoothness of the EWMA curve:

*   **Intuition for `β`:** EWMA can be seen as an average of approximately `1 / (1 - β)` previous days' data.
    *   If `β = 0.9`: It acts like an average of `1 / (1 - 0.9) = 1 / 0.1 = 10` days.
    *   If `β = 0.5`: It acts like an average of `1 / (1 - 0.5) = 1 / 0.5 = 2` days.
*   **High `β` value (e.g., 0.98):**
    *   Means you are giving **more weight to older, past points** (`Vt-1` is multiplied by a larger `β`).
    *   Results in a **smoother graph** that is less reactive to current fluctuations, as it incorporates more historical data. This personifies as a "stable" individual.
*   **Low `β` value (e.g., 0.1 or 0.5):**
    *   Means you are giving **more weight to current points** (`Tt` is multiplied by a larger `1 - β`).
    *   Results in a **"moody" or spiky graph** that closely follows the current data, making it less smooth and more oscillatory. This personifies as a "moody" individual.
*   **Sweet Spot for Deep Learning:** A common `β` value used in optimization algorithms in deep learning is **0.9**.

**7. Mathematical Proof of Weighting**
The formula `Vt = β * Vt-1 + (1 - β) * Tt` inherently gives more weight to recent data and less to older data.
By repeatedly substituting `Vt-1`, `Vt-2`, etc., into the formula, we can expand `Vt` in terms of current and past data points (`Tt`, `Tt-1`, `Tt-2`, ...):
*   `V1 = (1 - β)T1` (assuming V0=0)
*   `V2 = β(1 - β)T1 + (1 - β)T2`
*   `V3 = β²(1 - β)T1 + β(1 - β)T2 + (1 - β)T3`
*   `V4 = β³(1 - β)T1 + β²(1 - β)T2 + β(1 - β)T3 + (1 - β)T4`

Notice the coefficients for `T1`, `T2`, `T3`, `T4`:
*   `T1` (oldest) is multiplied by `β³(1 - β)`
*   `T2` is multiplied by `β²(1 - β)`
*   `T3` is multiplied by `β(1 - β)`
*   `T4` (most recent) is multiplied by `(1 - β)`

Since `β` is between 0 and 1, `β³ < β² < β`. This mathematically demonstrates that **older data points (`T1`) are multiplied by smaller coefficients (closer to zero) compared to newer data points (`T4`), confirming that newer points receive higher weights**.

**8. EWMA in Python (Pandas)**
*   The `pandas` library in Python provides an `ewm()` function to calculate EWMA.
*   It requires a parameter called `alpha`.
*   **`alpha` is equivalent to `(1 - β)`**. So, if `α = 0.1`, then `β = 0.9`.
*   **Syntax:** `dataframe_column.ewm(alpha=alpha_value).mean()`.
*   This function calculates the EWMA for each row/data point, which can then be merged back into the original DataFrame and plotted.
*   It is recommended to practice implementing EWMA from scratch without using the built-in function for better understanding.

***

In [13]:
beta=0.5
def ewma(lst,i):
    if i==0:
        return 0
    else:
        return beta*ewma(lst,i-1) + (1-beta)*lst[i-1]

In [18]:
lst=[1,2,3,4,5]
ewma(lst,5)

4.03125