# Homework 3: Recurrent Neural Networks (100 points)

### Overview

We now move from image recognition to natural language processing. For this assignment, we will work with a common sentiment analysis task using the IMDB dataset. This set consists of review-label pairs, where the task is to predict whether the text of the given movie review is positive or negative, a binary classification.

### RNN Architecture

You will be comparing four different recurrent neural network architectures: a standard RNN, a Gated Recurrent Unit (GRU), a standard Long Short-Term Memory (LSTM), and a bidirectional LSTM. 

Note that a GRU/LSTM cell _is_ an RNN cell, but we will refer to an RNN in the code and questions below as the most basic RNN.

### Your Task

At the bottom of this notebook file, there are three short answer questions testing your understanding of this neural network architecture. 

Below each question is a cell with the text “Type Markdown and LaTex.” Double-click the cell and type your response to the question. Save your responses by clicking on the floppy disk icon or choosing File - Save and Checkpoint.

After responding to the questions, download your notebook as a `.html` file by choosing File - Download as - html (.html). You will be submitting this `.html` file to your instructor for grading.

In [1]:
import torch
import torch.nn as nn
import pickle

In [2]:
torch.manual_seed(0)
torch.set_num_threads(4)
torch.set_num_interop_threads(4)

In [3]:
root_dir = 'assets_week3'
reviewVocabVectors = pickle.load(open(root_dir + '/reviewVocabVectors', 'rb'))
trainIterator = pickle.load(open(root_dir + '/trainIterator', 'rb'))
testIterator = pickle.load(open(root_dir + '/testIterator', 'rb'))

In [4]:
embeddingSize = 100
hiddenSize = 8
dropoutRate = 0.5
numEpochs = 2
vocabSize = 20002
pad = 1
unk = 0

class MyRNN(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.name = model
        self.LSTM = (model == 'LSTM' or model == 'BiLSTM')
        self.bidir = (model == 'BiLSTM')
        
        self.embed = nn.Embedding(vocabSize, embeddingSize, padding_idx = pad)
        
        if model == 'RNN': 
            self.rnn = nn.RNN(embeddingSize, hiddenSize)
        elif model == 'GRU': 
            self.rnn = nn.GRU(embeddingSize, hiddenSize)
        else: 
            self.rnn = nn.LSTM(embeddingSize, hiddenSize, bidirectional=self.bidir)

        self.dense = nn.Linear(hiddenSize * (2 if self.bidir else 1), 1)
        self.dropout = nn.Dropout(dropoutRate)
        
    def forward(self, text, textLengths):
        embedded = self.dropout(self.embed(text))
        
        packedEmbedded = nn.utils.rnn.pack_padded_sequence(embedded, textLengths)
        if self.LSTM: 
            packedOutput, (hidden, cell) = self.rnn(packedEmbedded)
        else: 
            packedOutput, hidden = self.rnn(packedEmbedded)

        output, outputLengths = nn.utils.rnn.pad_packed_sequence(packedOutput)
        if self.bidir: 
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        else: 
            hidden = hidden[0]

        return self.dense(self.dropout(hidden))

In [5]:
basicRNN = MyRNN(model='RNN')
GRU = MyRNN(model='GRU') # Construct a GRU model, as above
LSTM = MyRNN(model='LSTM') # Construct a LSTM model, as above
biLSTM = MyRNN(model='BiLSTM') # Construct a BiLSTM model, as above
models = [basicRNN, GRU, LSTM, biLSTM]

In [6]:
for model in models:
    if model is None:
        continue
    model.embed.weight.data.copy_(reviewVocabVectors)
    model.embed.weight.data[unk] = torch.zeros(embeddingSize)
    model.embed.weight.data[pad] = torch.zeros(embeddingSize)

In [7]:
criterion = nn.BCEWithLogitsLoss()

def batchAccuracy(preds, targets):
    roundedPreds = (preds >= 0)
    return (roundedPreds == targets).sum().item() / len(preds)

In [8]:
# Training

for model in models: 
    if model is not None:
        model.train()

for model in models:
    if model is None:
        continue
        
    torch.manual_seed(0)
    optimizer = torch.optim.Adam(model.parameters())
    for epoch in range(numEpochs):
        epochLoss = 0
        for batch in trainIterator:
            optimizer.zero_grad()
            text, textLen = batch[0]
            predictions = model(text, textLen).squeeze(1)
            loss = criterion(predictions, batch[1])
            loss.backward()
            optimizer.step()
            epochLoss += loss.item()
        print(f'Model: {model.name}, Epoch: {epoch + 1}, Train Loss: {epochLoss / len(trainIterator)}')
    print()

Model: RNN, Epoch: 1, Train Loss: 0.7126365525033468
Model: RNN, Epoch: 2, Train Loss: 0.6936717201071931

Model: GRU, Epoch: 1, Train Loss: 0.7025310509954877
Model: GRU, Epoch: 2, Train Loss: 0.686497655366083

Model: LSTM, Epoch: 1, Train Loss: 0.6937546568453464
Model: LSTM, Epoch: 2, Train Loss: 0.6713911288839471

Model: BiLSTM, Epoch: 1, Train Loss: 0.6949846438129844
Model: BiLSTM, Epoch: 2, Train Loss: 0.6875493588959775



In [9]:
# Evaluation
for model in models: 
    if model is not None:
        model.eval()

with torch.no_grad():
    
    for model in models:
        
        if model is None:
            continue

        accuracy = 0.0
        for batch in testIterator:
            text, textLen = batch[0]
            predictions = model(text, textLen).squeeze(1)
            loss = criterion(predictions, batch[1])
            acc = batchAccuracy(predictions, batch[1])
            accuracy += acc
        print('Model: {}, Validation Accuracy: {}%'.format(model.name, accuracy / len(testIterator) * 100))

Model: RNN, Validation Accuracy: 54.682704603580554%
Model: GRU, Validation Accuracy: 63.08903452685421%
Model: LSTM, Validation Accuracy: 74.12244245524298%
Model: BiLSTM, Validation Accuracy: 58.70284526854219%


## Homework Questions

**To make sure your code produces consistent results, it is advisable to click "Kernel -> Restart & Run All" every time you want to run your code.**

### Question 1: Coding (50 points)

First, run the code given above to assess accuracy of the default RNN model. 

Next, you will need to construct three other model types (GRU, LSTM, BiLSTM) for comparison purposes. Follow the comments in box 5 to initialize the three other model types then rerun the code with all models enabled.

Finally, compare the accuracies of all four models (the accuracy of the default RNN should not change from the initial run). Explain your results. In doing so, overview the advantages of the best performing architecture.

Validation Accuracy for RRN : ~70%

Below are the validation accuracy in the 2nd run:

Model: RNN, Validation Accuracy: 70.63459079283886%

Model: GRU, Validation Accuracy: 82.07800511508951%

Model: LSTM, Validation Accuracy: 83.81953324808184%

Model: BiLSTM, Validation Accuracy: 80.93190537084399%

Clearly LSTM is the best performing architecture for the imdb review classification.

Here are some of the advantages of using LSTM for sentiment analysis:

Ability to handle long-term dependencies: Since LSTM is specifically designed to address the vanishing gradient problem, it is able to effectively handle long-term dependencies in sequences. This is particularly useful for sentiment analysis, where the sentiment of a review may be influenced by words that are far apart in the sequence.

Memory cells to remember important information: LSTMs use memory cells to remember important information from previous time steps, which helps the network to maintain context throughout the sequence. This is particularly useful for sentiment analysis, where the sentiment of a review may depend on the overall context of the review.

Forget gates to filter out irrelevant information: LSTMs use forget gates to filter out irrelevant information from the previous time step, which helps the network to focus on the most important information for predicting the sentiment of the review. This is particularly useful for sentiment analysis, where there may be many words in a review that are not relevant to the overall sentiment.

Although BiLSTM do not have the best accuracy for our case, but BiLSTM can process the sequence in both forward and backward directions, which helps the network to capture dependencies that may be missed by unidirectional models. This is particularly useful for sentiment analysis, where the sentiment of a review may depend on the overall context of the review.

### Question 2: LSTM Gates (30 points)

LSTMs improve upon the naive RNN architecture by adding a series of gates instead of a simple matrix-vector computation. Name the gates and explain each of their functions.

The 4 gates in LSTMs and their functions are:

Input gate: This gate is responsible for deciding which new information to add to the cell state. It takes as input the previous hidden state and the current input and the output of the input gate layer is a vector of the same size as the cell state, with each element being a number between 0 and 1 that determines how much of the corresponding element in the new candidate vector to add to the cell state.

Forget gate: This gate is responsible for deciding which information to discard from the cell state. It takes as input the previous hidden state and the current input and has a sigmoid function called the forget gate layer, which decides which values to forget. The output of the forget gate layer is a vector of the same size as the cell state, with each element being a number between 0 and 1 that determines how much of the corresponding element in the previous cell state to keep.

Output gate: This gate is responsible for deciding which information from the cell state to output as the current hidden state. It takes as input the previous hidden state and the current input and has two components: a sigmoid function called the output gate layer, which decides which values to output, and a tanh function that scales the output values. The output of the output gate layer is a vector of the same size as the cell state, with each element being a number between 0 and 1 that determines how much of the corresponding element in the cell state to output.

Update gate: The update gate is an additional gate that is used in some variants of LSTMs. It is responsible for deciding which new information to add to the cell state, similar to the input gate. However, it is implemented differently from the input gate in that it takes as input the previous hidden state, the current input, and the current cell state, and uses these three inputs to calculate a new candidate state. The update gate is then used to blend the new candidate state with the previous cell state.

In summary, the input gate decides which new information to add to the cell state, the forget gate decides which information to discard from the cell state, the output gate decides which information to output as the current hidden state, and the update gate is used in some variants of LSTMs to update the cell state. Together, these gates allow LSTMs to selectively remember or forget information from previous time steps and to focus on the most relevant information for the current time step, making them well-suited for processing sequential data with long-term dependencies.

### Question 3: Applications (20 points)

LSTMs are used across many different fields, from music generation to sentiment classification to text generation. Where could they be useful in your life, whether at home, for your family, or in the workplace? Give a specific problem or application for an LSTM model that was not covered in the course slides (**though it can be related to the applications covered in the slides**). Then, with your application in mind, specifically identify the input to your model, the output from your model, and an applicable loss function. 

(As an optional extension, try implementing your LSTM on your own using the code framework given in this homework!)

I invest in stocks and would be great if I can make a model for price prediction.I could train an LSTM model to take in historical price and volume data for a given stock and predict its future prices.

The input to my model would be a time series of historical price and volume data for the given stock, with each time step representing a single day or hour. The output would be a sequence of predicted future prices for the next n time steps.

An applicable loss function for this problem could be mean squared error (MSE), which measures the average squared difference between the predicted and actual prices. This loss function is commonly used in regression problems and would help to penalize the model for large prediction errors.

By using an LSTM model to predict stock prices, I could potentially make more informed decisions about buying and selling stocks.