## I. Introduction to Backpropagation Through Time (BPTT)

*   **Learning in Neural Networks:** In standard neural networks, learning—the adjustment of weights—occurs through the process of **Backpropagation**.
*   **RNN Specificity:** In Recurrent Neural Networks (RNNs), this concept is extended and referred to as **Backpropagation Through Time**.
*   **Context:** These concepts are studied as part of continuing deep learning efforts focused on understanding how RNNs handle sequential data.

***

## II. Data Setup and RNN Architecture

### A. Example Scenario: Sentiment Analysis

*   We use a practical example of **Sentiment Analysis**, where the input is text, and the output determines the sentiment (e.g., 1 for positive, 0 for negative).
*   **Toy Dataset:** The input data consists of short reviews, each composed of three words.
    *   Example reviews include: "Cat Cat Mat," "Rat Rat Mat".
    *   Each review has an associated sentiment label (e.g., Positive or Negative).

### B. Data Representation

*   **Vector Conversion:** To process text, unique words in the vocabulary must be converted into numerical vectors.
*   If the vocabulary contains three unique words (Cat, Mat, Rat), each word is represented by a three-dimensional vector (e.g., Cat =, Mat =, Rat =).
*   The input to the RNN is therefore 3-dimensional.

### C. The RNN Cell and Weights

*   The architecture consists of units (or cells).
*   Since the example is a binary classification problem, the output layer has a single node, which produces the prediction ($\hat{Y}$).
*   **Weights (Parameters):** The model contains shared weights across all time steps:
    *   $W_I$ (Input Weight).
    *   $W_H$ (Hidden Weight).
    *   $W_{Output}$ (Output Weight).
    *   *Note:* Biases are generally present but are often ignored temporarily in the discussion for simplicity.

***

## III. Forward Propagation (Unfolding in Time)

To understand backpropagation, one must first trace the **forward propagation** path.

*   **Sequential Processing:** Reviews are inserted into the RNN **word by word**.
*   **Unfolding:** For a review with three words, the Recurrent Neural Network is **unfolded in time** into three steps.

1.  **Step 1 (First Word, $X_{11}$):** The first word is sent into the system. It uses an initial input (often zeros, $O_0$). This calculates the first hidden state, $O_1$.
    *   *Equation:* $O_1 = f(X_{11} W_I + O_0 W_H)$, where $f$ is an activation function.

2.  **Step 2 (Second Word, $X_{12}$):** The second word is sent in. It uses $O_1$ as its recurrent input. This calculates $O_2$.
    *   *Equation:* $O_2 = f(X_{12} W_I + O_1 W_H)$.

3.  **Step 3 (Third Word, $X_{13}$):** The third word is sent in. It uses $O_2$ as its recurrent input. This calculates $O_3$.
    *   *Equation:* $O_3 = f(X_{13} W_I + O_2 W_H)$.

4.  **Final Output ($\hat{Y}$):** Since there are no more words, $O_3$ is used to calculate the final prediction ($\hat{Y}$).
    *   *Equation:* $\hat{Y} = g(O_3 W_{Output})$, where $g$ is typically a Sigmoid activation function for binary classification.

### Loss Calculation

*   The calculated prediction ($\hat{Y}$) is compared with the true label ($Y$) to determine the **Loss ($L$)**.
*   For binary classification, the loss is typically calculated using binary cross-entropy:
    *   $L = -Y \log(\hat{Y}) - (1 - Y) \log(1 - \hat{Y})$.

***

## IV. Backpropagation and Gradient Descent

The primary goal of backpropagation is to **minimise the loss** by finding optimal values for the weights ($W_I$, $W_H$, $W_{Output}$). This involves applying the gradient descent update step.

### A. The Gradient Descent Update

Weights are updated iteratively using the learning rate and the calculated partial derivatives (gradients):
$$
W_{new} = W_{old} - \text{Learning Rate} \times \frac{\partial L}{\partial W}
$$
The core task is calculating the three required partial derivatives.

### B. Calculating $\frac{\partial L}{\partial W_{Output}}$

*   This derivative is the easiest to calculate because $W_{Output}$ is near the end of the forward propagation path.
*   The relationship is: $L \rightarrow \hat{Y} \rightarrow W_{Output}$.
*   Using the Chain Rule:
    $$
    \frac{\partial L}{\partial W_{Output}} = \frac{\partial L}{\partial \hat{Y}} \times \frac{\partial \hat{Y}}{\partial W_{Output}}
    $$
*   This involves simple differentiation of the Loss equation and the final output activation function.

### C. Calculating Derivatives for Recurrent Weights ($\frac{\partial L}{\partial W_I}$ and $\frac{\partial L}{\partial W_H}$)

Calculating the derivatives for the Input ($W_I$) and Hidden ($W_H$) weights is significantly more complex because **these weights are reused across every time step**.

#### 1. The Challenge of Unfolding

*   Because the RNN is **unfolded in time**, the relationship between the final Loss ($L$) and a recurrent weight ($W_I$ or $W_H$) involves **multiple separate computational paths**.
*   For a 3-word review, there are three distinct paths from $W_I$ to $L$ (one path for each time step where $W_I$ was used: $T=1, T=2, T=3$).
*   If the review had 10 words, there would be 10 separate paths, making manual calculation difficult.

#### 2. Calculating $\frac{\partial L}{\partial W_I}$ (Input Weight)

Since $W_I$ contributes to the loss at every time step, the final derivative is the **sum of the derivatives across all time steps** ($k$).

*   The relationship path is $L \rightarrow \hat{Y} \rightarrow O_k \rightarrow W_I$.
*   The derivative must be calculated by summing the contributions from each time step ($k=1$ to $N$, where $N$ is the total time steps):
    $$
    \frac{\partial L}{\partial W_I} = \sum_{k=1}^N \left( \frac{\partial L}{\partial \hat{Y}} \times \frac{\partial \hat{Y}}{\partial O_k} \times \frac{\partial O_k}{\partial W_I} \right)
    $$

#### 3. Calculating $\frac{\partial L}{\partial W_H}$ (Hidden Weight)

*   The calculation for $W_H$ is similarly complex.
*   $W_H$ is involved in the hidden state calculations ($O_k$) at every step.
*   The derivative must also be calculated by summing the contributions across all time steps, following the various paths (e.g., $L \rightarrow O_3$ directly via $W_H$, and indirectly via $O_2$ where $W_H$ was also used, etc.).

### D. Iterative Learning

Once all three derivatives ($\frac{\partial L}{\partial W_{Output}}$, $\frac{\partial L}{\partial W_I}$, and $\frac{\partial L}{\partial W_H}$) are successfully calculated:

1.  The gradient descent update step is applied, yielding new values for $W_I$, $W_H$, and $W_{Output}$.
2.  These new values are then used when processing the next piece of data (the second review, for example).
3.  This process of forward propagation, loss calculation, and backpropagation continues until the minimum loss value is achieved for the entire dataset.