# LSTMs(Long Short Term Memory) and GRUs(Gated Recurrent Unit)

# 1.LSTMs (Long Short Term Memory) Theory:
- Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems.
- This is a behavior required in complex problem domains like machine translation, speech recognition, and more. LSTMs are a complex area of deep learning.
- <img src="https://sds-platform-private.s3-us-east-2.amazonaws.com/uploads/33_blog_image_1.png" width="600" height="350">

## Why LSTM over Simple RNN ?
 ### Problems with RNN:
 - As we know a parameter called **HIDDEN STATE** is maintained through our the layers and hidden cells inorder to keep the **Sequential Information** preserved.
 - As we proceed further the **info of the first hidden cell** i.e., **previous hidden state** gets **depreciated** as shown below.
 - <img src="https://miro.medium.com/max/1000/1*d_POV7c8fzHbKuTgJzCxtA.gif" width="400" height="250">
 - As we proceed further in the *5th cell the info of the 1st cell became so small such that as we proceed more further it becomes **Zero**.*
 - After this **Forward Propagation** step, when we **Back propogate** to update the weights, the weights become so small that the weights donot update much.
 - This is the problem of **Vanishing Gradient**, where the weights never really change and the Loss would never shrink.
 - **It tries to keep info of all the words, which makes it learn the useless patterns as well.**

## How LSTM solves these problems ?
### Maths behind LSTMs:
- There is some more Terminology being added to the previous RNN architecture.
- Let us see the rnn architecture to rewind all the things that we came across.
- <img src="https://miro.medium.com/max/1400/1*WMnFSJHzOloFlJHU6fVN-g.gif" width="600" height="250">
- Here there is **Hidden state**, a parameter that carries on the info regarding previous cell.
- But this info would always sound more for the most recent cell rather than all the previous cells.
- That is the main reason for **Vanishing Gradient** problem, to over come this we have an extra parameter being added to the RNN called **Cell State** which is carried along with the **Hidden State**.
- Their mixup keep the irrelevant info away and we have different architecture for the **LSTM**. Let us understand more about it from below diagram.
- <img src="https://miro.medium.com/max/388/1*hG4zBCCRq18oi8aarj-owA.png" width="600" height="450">
- This diagram is to indicate different portion of the cells.
- We will be using illustrated GIFs to understand the flow much better.
- **In Forget Cell:**
  - <img src="https://miro.medium.com/max/1900/1*GjehOa513_BgpDDP6Vkw2Q.gif" width="600" height="350">
  - The **hidden state($h_{t-1}$)** is being passed into the cell along with the **input($x_{t}$)**.
  - Now addition happens between the inputs **($h_{t-1}$+$x_{t}$)** and this result is being passed into the **Sigmoid Activation** that is **σ($h_{t-1}$+$x_{t}$)** and is sent into the **Cell state region**.
  - Output of this cell is **$f_{t}$=σ($h_{t-1}$+$x_{t}$)**

- **In Input Gate:**
  - <img src="https://miro.medium.com/max/1900/0*KhHCSln2LmiFGH8r.gif" width="600" height="350">
  - Here **$h_{t-1}$+$x_{t}$** is used as input in both the activation functions **Sigmoid** and **tanh** respectively.
  - When sent into the **sigmoid** the output becomes **$i_{t}$=σ($h_{t-1}$+$x_{t}$)**, also the **$h_{t-1}$+$x_{t}$** is sent into the **TanH** Activation that returns **tanh($h_{t-1}$+$x_{t}$)**.
  - Now the cross product happens between **$i_{t}$=σ($h_{t-1}$+$x_{t}$)** and **tanh($h_{t-1}$+$x_{t}$)**.
  - This **Cross Product** is sent into the **Cell State.** i.e., **$i_{0}$**

- **In Cell State:**
  - <img src="https://miro.medium.com/max/1900/1*cmv5EOAd6iWMzWvHrZbl-w.gif" width="600" height="350">
  - Firstly the output from **Forget cell** undergoes **Dot Product** with **Previous cell state ($c_{t-1}$).**
  - Now the dot product of two vectors create a new vector and this is being added with the output from input Gate **($i_{0}$)**.
  - Now this is the **New Cell State** i.e., **($c_{t}$)**.

- **Finally Output Gate along all:**
  - <img src="https://miro.medium.com/max/1032/1*AUwc53cmW04hjPrKtVyePQ.gif" width="600" height="350">
  - Output from **Input Gate** i.e., **($h_{t-1}$+$x_{t}$)** is now being send as an input to the **Output Gate**, the **Sigmoid Activation** converts it and gives an output **σ($h_{t-1}$+$x_{t}$)**.
  - **New cell state** is fed into this gate again and **TanH** activation is applied on this new cell state i.e., **tanh($c_{t}$)**.
  - Now again **Cross product** happens between both the results i.e., **σ($h_{t-1}$+$x_{t}$)** ✖  **tanh($c_{t}$)**.
  - Now this is used as **New Hidden State**.

- This is the functionality of the whole **LSTM** Networks.

## How did LSTM solved the problems in RNN ?**
- As we maintain Cell state all along the network we can always trigger or depreciated the current hidden state depending on its output from Sigmoid.
- This will keep irrelevant info of hidden state being kept away from the cell state (Which is regarded as main Sequential Info).
- This will solve the problem of **Vanishing Gradient** problem as well.


# 2.GRUs (Gated Recurrent Unit) Theory:
- These are again an alternate to the problems faced by RNNs.
- Researchers use LSTMs and GRUs at the same time and decide what to choose based on the performance. Generally LSTMs are default choice.
## Understanding GRUs and its Maths
- <img src="https://miro.medium.com/max/2084/1*jhi5uOm9PvZfmxvfaCektw.png" width="600" height="450">
- GRU supports gating and a hidden state to control the flow of information. 
- To solve the problem that comes up in RNN, GRU uses two gates: the update gate and the reset gate.
- <img src="https://miro.medium.com/max/3838/1*v7Ax40Y01a5IdFsb69crGg.jpeg" width="600" height="350">
- **Update Gate:**
  - If we compare it with the LSTMs then **Update Gate is the combination of both the Forget and Input Gate.**
  - The update gate **$z_{t}$** is responsible for determining the amount of previous information (prior time steps) that needs to be passed along the next state.
  - The above math operation happens in it and the output is served in to the Gates action phase.

- **Reset Gate:**
  - The reset gate **$r_{t}$** is used from the model to decide how much of the past information is needed to neglect. 
  - It follows up the same formula.

- Let us move forward with the Implementations.

**NOTE:** For Further info please go through this awesome blog:
[RNN by Stanford](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks), and for video watch this simple illustration [RNN Illustration Video](https://www.youtube.com/watch?v=LHXXI4-IEns&t=496s&ab_channel=TheA.I.Hacker-MichaelPhiTheA.I.Hacker-MichaelPhi)

# Importing Packages

In [2]:
import torch
import torchvision
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
import torchvision.datasets as datasets
import torchvision.transforms as transforms

# Let us Build a LSTM and GRU networks.

## Setting up **DATASET** and **HYPERPARAMETERS**.

In [3]:
input_size = 28 # Image of 28x28 pixels which upon flattening becomes 784.
sequence_length = 28
num_of_layers = 2 # Number of RNN layers we want.
hidden_size = 256 # Number of neuron per hidden state.
num_classes = 10 # Number of classes in the Dataset are 10.
learning_rate = 0.001 # Speed at which we want our optimizer to optimize our solution.
batch_size = 64 # Size of the batch that will undergo training at one step.
epochs = 2 # Steps of training or times a forward and backward propagation is done.

# Firstly Loading a Data and downloading it to the folder.
train_dataset = datasets.MNIST(root="content/",train=True,
                               transform=transforms.ToTensor(),download=True)
# Now setting up its properties like batchsize
trainloader = DataLoader(dataset=train_dataset,
                         batch_size = batch_size,
                         shuffle=True
                         )

# Doing the same for testset as well
test_dataset = datasets.MNIST(root="content/",train=False,
                               transform=transforms.ToTensor(),download=True)

testloader = DataLoader(dataset=test_dataset,
                         batch_size = batch_size,
                         shuffle=True
                         )

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to content/MNIST/raw/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 503: Service Unavailable

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to content/MNIST/raw/train-images-idx3-ubyte.gz


HBox(children=(FloatProgress(value=0.0, max=9912422.0), HTML(value='')))


Extracting content/MNIST/raw/train-images-idx3-ubyte.gz to content/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to content/MNIST/raw/train-labels-idx1-ubyte.gz


HBox(children=(FloatProgress(value=0.0, max=28881.0), HTML(value='')))


Extracting content/MNIST/raw/train-labels-idx1-ubyte.gz to content/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to content/MNIST/raw/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 503: Service Unavailable

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to content/MNIST/raw/t10k-images-idx3-ubyte.gz


HBox(children=(FloatProgress(value=0.0, max=1648877.0), HTML(value='')))


Extracting content/MNIST/raw/t10k-images-idx3-ubyte.gz to content/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to content/MNIST/raw/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 503: Service Unavailable

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to content/MNIST/raw/t10k-labels-idx1-ubyte.gz


HBox(children=(FloatProgress(value=0.0, max=4542.0), HTML(value='')))


Extracting content/MNIST/raw/t10k-labels-idx1-ubyte.gz to content/MNIST/raw

Processing...
Done!


  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)


## Creating LSTM and GRU Networks

In [5]:
# LSTM network.

class LSTM(nn.Module):
  def __init__(self, input_size, hidden_size, num_of_layers, num_of_classes):
    super(LSTM, self).__init__()
    self.hidden_size = hidden_size
    self.num_of_layers = num_of_layers
    self.num_of_classes = num_of_classes
    self.lstm = nn.LSTM(input_size, hidden_size, num_of_layers, batch_first=True)
    self.fc = nn.Linear(hidden_size*sequence_length, num_of_classes)
  
  def forward(self, x):
    # Initializing hidden state.
    h0 = torch.zeros(self.num_of_layers, x.size(0), self.hidden_size) # Initial Hidden State for each element in the batch.
    c0 = torch.zeros(self.num_of_layers, x.size(0), self.hidden_size) # Initial Cell State for each element in the batch.

    # Forward step.
    lstm_out_vector,_ = self.lstm(x, (h0, c0)) # It takes in input vector followed by a tuple of hidden state and cell state, returns and output vector along with hidden state.
    lstm_out_vector = lstm_out_vector.reshape(lstm_out_vector.shape[0],-1)
    output_from_linear_layer = self.fc(lstm_out_vector) # Sending in output of RNN into the Linear Layer(ANN)
    return output_from_linear_layer


# GRU network
class GRU(nn.Module):
  def __init__(self, input_size, hidden_size, num_of_layers, num_of_classes):
    super(GRU, self).__init__()
    self.hidden_size = hidden_size
    self.num_of_layers = num_of_layers
    self.num_of_classes = num_of_classes
    self.gru = nn.GRU(input_size, hidden_size, num_of_layers, batch_first=True)
    self.fc = nn.Linear(hidden_size*sequence_length, num_of_classes)
  
  def forward(self, x):
    # Initializing hidden state.
    h0 = torch.zeros(self.num_of_layers, x.size(0), self.hidden_size) # Initial Hidden State for each element in the batch.

    # Forward step.
    gru_out_vector,_ = self.gru(x, h0) # It takes in input vector and a hidden state , returns and output vector along with hidden state.
    gru_out_vector = gru_out_vector.reshape(gru_out_vector.shape[0],-1)
    output_from_linear_layer = self.fc(gru_out_vector) # Sending in output of RNN into the Linear Layer(ANN)
    return output_from_linear_layer


# Initializing Model, Loss, Optimizer

In [6]:
# Initializing the LSTM Model
lstm_model = LSTM(input_size, hidden_size, num_of_layers, num_classes)

# Initializing the GRU Model
gru_model = GRU(input_size, hidden_size, num_of_layers, num_classes)

In [8]:

# Using CrossEntropyLoss as we have multiple classes.
loss_function = nn.CrossEntropyLoss()

# Optimizer as ADAM for both the Architectiures.
lstm_optimizer = optim.Adam(params=lstm_model.parameters(),lr=learning_rate)
gru_optimizer = optim.Adam(params=gru_model.parameters(),lr=learning_rate)

## Let us train LSTM first.

In [11]:
for epoch in range(epochs):
  for batch_index, (data, targets) in enumerate(trainloader):# output of format = batch_index, (element_data, its_target)
    data=data.squeeze(1)
    # Forward Step
     # Making Predictions on train_data
    training_predictions = lstm_model(data)
     # Calculating loss
    Training_loss = loss_function(training_predictions, targets)
    # Backward Step
    lstm_optimizer.zero_grad()
    Training_loss.backward()

    # Optimizer Step
    lstm_optimizer.step()
  print(f'At {epoch} epochs Training_loss={Training_loss}')



At 0 epochs Training_loss=0.03627399355173111
At 1 epochs Training_loss=0.001295591937378049


## Now let us train GRU model now.

In [12]:
for epoch in range(epochs):
  for batch_index, (data, targets) in enumerate(trainloader):# output of format = batch_index, (element_data, its_target)
    data=data.squeeze(1)
    # Forward Step
     # Making Predictions on train_data
    training_predictions = gru_model(data)
     # Calculating loss
    Training_loss = loss_function(training_predictions, targets)
    # Backward Step
    gru_optimizer.zero_grad()
    Training_loss.backward()

    # Optimizer Step
    gru_optimizer.step()
  print(f'At {epoch} epochs Training_loss={Training_loss}')



At 0 epochs Training_loss=0.04876665398478508
At 1 epochs Training_loss=0.05137406289577484


## Defining Metric Calculator

In [13]:
# Custom Function that calcultes accuracy of the Model.

def check_accuracy(loader, model):
    if loader.dataset.train:
        print("Checking accuracy on training data")
    else:
        print("Checking accuracy on test data")

    num_correct = 0
    num_samples = 0
    model.eval()

    with torch.no_grad():
        for x, y in loader:
            x = x.squeeze(1)
            scores = model(x)
            _, predictions = scores.max(1)
            num_correct += (predictions == y).sum()
            num_samples += predictions.size(0)

        print(f"Got {num_correct} / {num_samples} with accuracy = {float(num_correct)/float(num_samples)*100:.2f}")

## Analyzing the performances.

In [14]:
print('Training LSTM Model')
lstm_model.train() # Training the Model once again.
print('Training was succesful')
print('Calculating scores:\n')
check_accuracy(trainloader, lstm_model)
check_accuracy(testloader, lstm_model)

Training LSTM Model
Training was succesful
Calculating scores:

Checking accuracy on training data
Got 59264 / 60000 with accuracy = 98.77
Checking accuracy on test data
Got 9846 / 10000 with accuracy = 98.46


In [15]:
print('Training GRU Model')
gru_model.train() # Training the Model once again.
print('Training was succesful')
print('Calculating scores:\n')
check_accuracy(trainloader, gru_model)
check_accuracy(testloader, gru_model)

Training GRU Model
Training was succesful
Calculating scores:

Checking accuracy on training data
Got 59322 / 60000 with accuracy = 98.87
Checking accuracy on test data
Got 9861 / 10000 with accuracy = 98.61


- Well thats some good score. So this was all about the implementations of LSTMs and GRUs.
# **THANK YOU!**