# The Vanishing Gradient Problem

In standard RNNs, gradients are propagated backward through time. The gradient of the loss function with respect to earlier layers is calculated using repeated multiplications by the Jacobian matrix. When the eigenvalues of this matrix are less than 1, the gradients shrink exponentially as they propagate back, leading to the **vanishing gradient problem**.

### Simplified Gradient Expression for RNN:
$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial h_T} \prod_{t=1}^T \frac{\partial h_t}{\partial h_{t-1}}$$

If $\|\frac{\partial h_t}{\partial h_{t-1}}\| < 1$, gradients diminish as $t$ increases.

## How LSTMs Mitigate Vanishing Gradients:

1. **Cell State ($C_t$):** LSTMs maintain a nearly constant error flow by introducing a cell state, which is updated additively rather than multiplicatively, preserving information over long time steps.
2. **Gated Mechanisms:** The gates control the flow of information, allowing the model to selectively update and forget information, avoiding the uncontrolled growth or shrinkage of gradients.



# Long Short-Term Memory (LSTM) Cell

LSTM is a type of recurrent neural network (RNN) designed to address the problem of long-term dependency, which standard RNNs struggle with due to the vanishing gradient problem. LSTMs achieve this using a unique architecture involving three main gates:

1. **Forget Gate** ($f_t$): Decides what information to discard from the cell state.
2. **Input Gate** ($i_t$): Determines what information to update in the cell state.
3. **Output Gate** ($o_t$): Controls what information is sent to the output.

## LSTM Cell Equations:

### Forget Gate:
$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$

### Input Gate:
$$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$
$$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$

### Cell State Update:
$$C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t$$

### Output Gate:
$$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$
$$h_t = o_t \cdot \tanh(C_t)$$

Here, $\sigma$ represents the sigmoid activation function, and $\tanh$ represents the hyperbolic tangent activation function.
