# LSTM Architecture and Mechanism

## Core Concepts and Architecture Overview

The complexity of the LSTM architecture, when compared to a standard RNN architecture, stems from its ability to maintain and manage **two types of context**:

1.  **Long-Term Context/Memory:** Called the **Cell State** ($C_t$).
2.  **Short-Term Context/Memory:** Called the **Hidden State** ($H_t$).

The complex design primarily serves to set up communication and interaction between the Long-Term and Short-Term memory components.

### Inputs and Outputs

For a particular timestamp (e.g., $t_1$), the LSTM cell takes **three inputs**:

1.  **Previous Cell State** ($C_{t-1}$).
2.  **Previous Hidden State** ($H_{t-1}$).
3.  **Input for the current time stamp** ($X_t$).

The LSTM cell produces **two outputs**:

1.  **Current Cell State** ($C_t$).
2.  **Current Hidden State** ($H_t$).

### Processing Goals

The processing part of the LSTM cell has two main jobs:

1.  **Updating the Cell State** ($C_{t-1}$ to $C_t$). This update involves deciding what existing information to **remove** (if unnecessary) and what **new information to add** (if necessary), both based on the current input.
2.  **Calculating the Hidden State** ($H_t$), which is then passed to the next timestamp.

## Mathematical Components and Operations

### Vectors and Dimensions

The core states and inputs within the LSTM are represented mathematically as **vectors** (collections of numbers).

*   **Cell State ($C_t$) and Hidden State ($H_t$):** Both are vectors. The dimension (shape) of the input vectors and output vectors must be exactly the same.
*   **Input ($X_t$):** The input is also a vector, representing the word converted into numbers (e.g., via text vectorisation).
*   **Gate Outputs ($F_t, I_t, \tilde{C}_t, O_t$):** The outputs of the various gates (Forget, Input, Candidate Cell State, Output) are also vectors. All six vectors ($C_t, H_t, F_t, I_t, \tilde{C}_t, O_t$) have the exact same shape or number of dimensions.

### Mathematical Operations (Red and Yellow Units)

The red units in the architecture represent point-wise mathematical operations.

*   **Point-wise Multiplication ($\odot$):** Multiplies corresponding elements of two vectors of the same shape, resulting in a new vector.
*   **Point-wise Addition ($+$):** Adds corresponding elements of two vectors of the same shape.
*   **Activation Functions (Yellow Units):** These units represent Neural Network layers containing activation functions (Sigmoid $\sigma$ or Tanh $\tanh$).
    *   **Tanh ($\tanh$):** Squashes values between -1 and 1.
    *   **Sigmoid ($\sigma$):** Squashes values between 0 and 1.

The number of nodes or units in these Neural Network layers is flexible but must be consistently maintained throughout the cell, typically matching the dimension of the state vectors.

## The Three Gates

The core function of the LSTM cell is governed by three primary gates: the Forget Gate, the Input Gate, and the Output Gate.

### 1. The Forget Gate ($F_t$)

<img src="https://i.ibb.co/mCqFrf5Z/image.png">

The Forget Gate determines **what to remove** from the previous Cell State ($C_{t-1}$).

*   **Inputs:** Current Input ($X_t$) and Previous Hidden State ($H_{t-1}$).
*   **Mechanism:** This gate uses a Neural Network layer with a **Sigmoid** activation function. The output, $F_t$, is a vector with values between 0 and 1.
*   **Equation (Calculation of $F_t$):**
    $$F_t = \sigma(W_f \cdot [H_{t-1}, X_t] + B_f)$$
    (Where $W_f$ is the weight matrix, $[H_{t-1}, X_t]$ is the concatenation of the vectors, and $B_f$ is the bias).
*   **Application (Forgetting):** $F_t$ is then used in a **point-wise multiplication** with the Previous Cell State ($C_{t-1}$).
    $$F_t \odot C_{t-1}$$
*   **Significance:** The term "gate" is used because $F_t$ has the power to decide how much information from $C_{t-1}$ flows forward.
    *   If $F_t$ produces 0 for a specific dimension, 100% of that information is removed (forgotten).
    *   If $F_t$ produces 1, 100% of that information passes through.

### 2. The Input Gate ($I_t$ and $\tilde{C}_t$)

<img src="https://i.ibb.co/d0nCL2JX/image.png">

The Input Gate decides **what new important information to add** to the Cell State. This process involves two main stages.

**Stage 1: Candidate Cell State ($\tilde{C}_t$)**

*   This stage calculates new candidate values that might be worthy of adding to the Cell State.
*   **Mechanism:** It uses a Neural Network layer with a **Tanh** activation function.
*   **Equation:**
    $$\tilde{C}_t = \tanh(W_c \cdot [H_{t-1}, X_t] + B_c)$$
    ($\tilde{C}_t$ is the Candidate Cell State).

**Stage 2: Input Filter ($I_t$)**

*   This stage determines which parts of the Candidate Cell State ($\tilde{C}_t$) are actually important enough to be added.
*   **Mechanism:** It uses a Neural Network layer with a **Sigmoid** activation function.
*   **Equation:**
    $$I_t = \sigma(W_i \cdot [H_{t-1}, X_t] + B_i)$$
    ($I_t$ is the Input Filter).
*   **Application (Filtering):** The filtered candidate information is calculated via **point-wise multiplication**:
    $$I_t \odot \tilde{C}_t$$

### 3. Cell State Update


The results from the Forget Gate and the Input Gate are combined using **point-wise addition** to generate the new Cell State ($C_t$).

*   **Final Cell State Equation:**
    $$C_t = (F_t \odot C_{t-1}) + (I_t \odot \tilde{C}_t)$$

This structure is crucial because it allows information to potentially pass through the Cell State path relatively untouched (if $F_t=1$ and $I_t \odot \tilde{C}_t$ is near zero), which helps to prevent the **vanishing gradient problem** often encountered in standard RNNs, ensuring long-term memory is not lost.

### 4. The Output Gate ($O_t$)

<img src="https://i.ibb.co/pr0ZGWQf/image.png">

The Output Gate determines the **value of the Current Hidden State** ($H_t$) for the current time step. It filters the newly updated Cell State ($C_t$).

**Stage 1: Tuning the Cell State**

*   The updated Cell State ($C_t$) is passed through a **Tanh** function.

**Stage 2: Output Filter ($O_t$)**

*   This filter decides which parts of the (tuned) Cell State should be output as the Hidden State.
*   **Mechanism:** It uses a Neural Network layer with a **Sigmoid** activation function.
*   **Equation:**
    $$O_t = \sigma(W_o \cdot [H_{t-1}, X_t] + B_o)$$

**Stage 3: Hidden State Calculation**

*   The final Hidden State ($H_t$) is calculated by **point-wise multiplication** of the Output Filter ($O_t$) and the tuned Cell State ($\tanh(C_t)$).
*   **Final Hidden State Equation:**
    $$H_t = O_t \odot \tanh(C_t)$$

<img src="https://miro.medium.com/v2/1*goJVQs-p9kgLODFNyhl9zA.gif">

<video><source src="https://packaged-media.redd.it/afzlbpt2ncg81/pb/m2-res_1080p.mp4?m=DASHPlaylist.mpd&v=1&e=1760396400&s=7b04600131aa454e5c0b5854952a5df6ff271ee7" type="video/mp4"></video>

# Understanding LSTM Outputs and How to Combine Them (Keras)

## üß† 1Ô∏è‚É£ Available LSTM Output Options in Keras

An LSTM layer can output: - Full sequence of hidden states
(`return_sequences=True`) - Final hidden & cell states
(`return_state=True`) - Or both at once.

### üîπ OPTION A --- Default (single output)

``` python
LSTM(units)
```

**Returns:** only the final hidden state ‚Üí `h_t` (shape:
`(batch, units)`)

‚úÖ Used for: - Classification tasks - Sequence summary (e.g., sentiment
analysis)

------------------------------------------------------------------------

### üîπ OPTION B --- Return sequence of outputs

``` python
LSTM(units, return_sequences=True)
```

**Returns:** sequence of all hidden states ‚Üí `(batch, timesteps, units)`

‚úÖ Used for: - Many-to-many models (e.g., sequence tagging,
translation) - Feeding into another LSTM (stacked LSTMs)

------------------------------------------------------------------------

### üîπ OPTION C --- Return final states

``` python
LSTM(units, return_state=True)
```

**Returns:**\
- `output` ‚Üí final hidden state (`h_t`)\
- `h` ‚Üí hidden state\
- `c` ‚Üí cell state

‚úÖ Used for: - Encoder--Decoder models (Seq2Seq) - Combining long-term &
short-term memory manually

------------------------------------------------------------------------

### üîπ OPTION D --- Both sequences and states

``` python
LSTM(units, return_sequences=True, return_state=True)
```

**Returns:**\
- `outputs` ‚Üí all hidden states\
- `h` ‚Üí final hidden state\
- `c` ‚Üí final cell state

‚úÖ Used for: - Custom RNN architectures - Attention mechanisms where
both full sequence and final state are needed

------------------------------------------------------------------------
