# The Vanishing Gradient Problem

In standard RNNs, gradients are propagated backward through time. The gradient of the loss function with respect to earlier layers is calculated using repeated multiplications by the Jacobian matrix. When the eigenvalues of this matrix are less than 1, the gradients shrink exponentially as they propagate back, leading to the **vanishing gradient problem**.

### Simplified Gradient Expression for RNN:
$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial h_T} \prod_{t=1}^T \frac{\partial h_t}{\partial h_{t-1}}$$

If $\|\frac{\partial h_t}{\partial h_{t-1}}\| < 1$, gradients diminish as $t$ increases.

## How LSTMs Mitigate Vanishing Gradients:

1. **Cell State ($C_t$):** LSTMs maintain a nearly constant error flow by introducing a cell state, which is updated additively rather than multiplicatively, preserving information over long time steps.
2. **Gated Mechanisms:** The gates control the flow of information, allowing the model to selectively update and forget information, avoiding the uncontrolled growth or shrinkage of gradients.



# Long Short-Term Memory (LSTM) Cell

LSTM is a type of recurrent neural network (RNN) designed to address the problem of long-term dependency, which standard RNNs struggle with due to the vanishing gradient problem. LSTMs achieve this using a unique architecture involving three main gates:

1. **Forget Gate** ($f_t$): Decides what information to discard from the cell state.
2. **Input Gate** ($i_t$): Determines what information to update in the cell state.
3. **Output Gate** ($o_t$): Controls what information is sent to the output.

## LSTM Cell Equations:

### Forget Gate:
$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$

### Input Gate:
$$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$
$$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$

### Cell State Update:
$$C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t$$

### Output Gate:
$$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$
$$h_t = o_t \cdot \tanh(C_t)$$

Here, $\sigma$ represents the sigmoid activation function, and $\tanh$ represents the hyperbolic tangent activation function.


In [2]:
import torch
bs, seq_len, embd_size, hidden_size = 2, 10, 300, 256

xt=torch.rand((bs,seq_len,embd_size))
h0=torch.rand((bs, hidden_size))
c0=torch.rand((bs, hidden_size))



In [20]:
class MyLSTM(torch.nn.Module):

    def __init__(self, input_dim, hidden_dim ):
        super(MyLSTM, self).__init__()

        self.f = torch.nn.Linear(input_dim + hidden_dim,  hidden_dim)
        self.i = torch.nn.Linear(input_dim + hidden_dim,  hidden_dim)
        self.c_ = torch.nn.Linear(input_dim + hidden_dim,  hidden_dim)
        self.o = torch.nn.Linear(input_dim + hidden_dim,  hidden_dim)
      
    def forward(self, x,h,c):
        hidden_states=[]
        bs, seq_len, embd_size=x.shape
        
        for t in range(seq_len):
            x_h=torch.cat([x[:, t], h], 1)
            f_t=torch.sigmoid(self.f(x_h))
            i_t=torch.sigmoid(self.f(x_h))
            o_t=torch.sigmoid(self.f(x_h))
            c__t=torch.tanh(self.f(x_h))
            
            c=(f_t*c)+ (i_t*c__t)
            h=o_t*torch.tanh(c)
            hidden_states.append(c)

        hidden_states=torch.stack(hidden_states).permute(1,0,2)
      
        return hidden_states, (h, c)

In [21]:
lstm=MyLSTM(embd_size, hidden_size)
hidden_states, (hn,cn)=lstm(xt,h0,c0)

In [22]:
hidden_states.shape

torch.Size([2, 10, 256])

In [24]:
hn.shape, cn.shape

(torch.Size([2, 256]), torch.Size([2, 256]))

In [26]:
torch.cat((hn,cn),1).shape

torch.Size([2, 512])

## Pytorch LSTM Layer
Creating an LSTM in PyTorch involves understanding both the built-in `nn.LSTM` module and how to use it for sequence modeling tasks. Here's a step-by-step guide:

---

### **1. Understand the `nn.LSTM` Module**
The `nn.LSTM` module in PyTorch implements a multi-layer LSTM. It automatically handles the forward propagation of LSTM cells for entire sequences.

Key parameters:
- **`input_size`**: The number of features in the input.
- **`hidden_size`**: The number of features in the hidden state.
- **`num_layers`**: Number of stacked LSTM layers.
- **`batch_first`**: If `True`, input/output tensors have shape `(batch, seq, feature)` instead of `(seq, batch, feature)`.

---

### **2. Define the Model**
An LSTM model in PyTorch typically consists of:
- An embedding layer (optional for NLP tasks).
- An LSTM layer (`nn.LSTM`).
- A fully connected layer to map the LSTM output to the desired prediction size.



In [29]:
lstm= torch.nn.LSTM(embd_size, hidden_size,1, batch_first=True)

In [31]:
hidden_states, (hn,cn)=lstm(xt)

In [32]:
hidden_states.shape, hn.shape, cn.shape

(torch.Size([2, 10, 256]), torch.Size([1, 2, 256]), torch.Size([1, 2, 256]))