# Assignment 4: Recurrent Neural Networks (41 marks total)
### Due: November 19 at 11:59pm (grace period until November 21 at 11:59pm)

### Name:

The goal of this assignment is to apply Recurrent Neural Networks (RNNs) in PyTorch for text data classification.

## Part 1: LSTM

### Step 0: Import Libraries

In [132]:
import torch
from datasets import load_dataset
from collections import Counter
import re
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn

In [133]:
import warnings
warnings.filterwarnings(action='ignore')

### Step 1: Data Loading and Preprocessing (12 marks)

For this assignment, we will be using the imdb dataset from the ðŸ¤— Datasets library

In [134]:
# TO DO: Load the dataset (1 mark)
dataset = load_dataset("imdb")
train_data = dataset['train']
test_data = dataset['test']

We need to preprocess the data before we can feed it into the model. The first step is to define a custom tokenizer to perform the following tasks: 
- Extract the text data from the dataset
- Remove any non-alphanumeric characters
- Separate each data sample into separate words (tokens)

In [135]:
def tokenizer(data_iter):
    '''Tokenizes the input data
    input: data_iter (type: dictionary)
    output: text (type: list[list[str]])
    '''

    text = []

    for review in data_iter:
        review['text'] = re.sub(r"[^a-zA-Z0-9\s]", '', review['text'])
        review['text'] = review['text'].lower()
        text.append(review['text'].split())

    return text

We will also need to extract the labels from the dataset. Complete the label_extractor function below:

In [136]:
def label_extractor(data_iter):
    '''Takes the label for each data sample and stores it in a separate list
    input: data_iter (type: dictionary)
    output: labels (type: list)
    '''
    # TO DO: fill in this function (1 mark)
    labels = [item['label'] for item in data_iter]
    return labels

Now that we have the text data separated into words, we need to define the vocabulary. We cannot keep all the words in the vocabulary, so we want to limit the vocabulary size and only take the most common words. In this case, the maximum vocabulary size is 10,000 words. Any word that is excluded will be set to an unknown token. You can use the function below to build the vocabulary:

In [137]:
# Build a vocabulary
def build_vocab(data_iter, max_size=10000):
    '''Creates a vocabulary based on the training data
    input: data_iter (type: list[list[str]])
    output: vocab (type: dictionary)
    '''
    counter = Counter()
    for words in data_iter:
        counter.update(words)
    # Filter to most common words
    vocab = {word: i + 1 for i, (word, _) in enumerate(counter.most_common(max_size))}
    # Add a token for unknown words (0)
    vocab['<unk>'] = 0 
    return vocab

In the vocabulary, each word is mapped to a number in the vocabulary. We will need to encode the dataset based on these numbers, as tensors cannot handle string data.

The next step is to pad or truncate each sequence based on a maximum length, to make sure that the dataset can be transformed into a tensor (as discussed in class).

Fill in the function below to encode and pad the dataset:

In [138]:
def encode_and_pad(text, vocab, max_len=100):
    '''Encode and pad the input text dataset
    input: text (type: list[list[str]])
    input: vocab (type: dictionary)
    input: max_len (type: int)
    output: texts (type: list[list[str]])
    '''
    # TO DO: fill in the function to encode text to integers and pad/truncate sequences (2 marks)
    texts = []
    for sentence in text:
        encoded = [vocab.get(word, vocab['<unk>']) for word in sentence]
        if len(encoded) < max_len:
            encoded += [0] * (max_len - len(encoded))
        else:
            encoded = encoded[:max_len]
        texts.append(encoded)
    return texts


The next step is to create a custom PyTorch Dataset class that calls the `encode_and_pad()` function and stores the text and labels as tensors. Fill in the `init` portion of the class: 

In [139]:
# Create a custom PyTorch Dataset class
class TextDataset(Dataset):
    def __init__(self, texts, labels, vocab, max_len):
        # TO DO: call the encode_and_pad() function and set self.texts and self.labels (2 marks)
        self.texts = torch.tensor(encode_and_pad(texts, vocab, max_len))
        self.labels = torch.tensor(labels)
    def __len__(self): 
        return len(self.labels)
    def __getitem__(self, idx): 
        return self.texts[idx], self.labels[idx]

Now you can call all the functions that have been created:

In [140]:
MAX_LEN = 256 # Sequence length
BATCH_SIZE = 64

# TO DO: Tokenize training data (1 mark)
train_texts = tokenizer(train_data)

# TO DO: Extract labels from training and testing data (1 mark)
train_labels = label_extractor(train_data)
test_labels = label_extractor(test_data)

# TO DO: Build Vocabulary (from training data only) (1 mark)
vocab = build_vocab(train_texts)

# TO DO: Prepare datasets (using TextDataset class) and store datasets using DataLoaders (1 mark)
train_dataset = TextDataset(train_texts, train_labels, vocab, MAX_LEN)
test_dataset = TextDataset(tokenizer(test_data), test_labels, vocab, MAX_LEN)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
text_batch, label_batch = next(iter(train_loader))
print(text_batch.shape)
print(label_batch.shape)


torch.Size([64, 256])
torch.Size([64])


### Step 2: Define Model (4 marks)

For this assignment, we will be using the LSTM model. Inside the LSTM model, the first layer will be an embedding layer, to convert the singular numerical representation of each word into an embedded vector. We can use `nn.Embedding(...)` for this.

Define LSTMClassifier below:

In [141]:
# TO DO: Define LSTM class (4 marks)
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, num_layers=1):
        super().__init__()
        # TO DO: Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # TO DO: LSTM layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True)
        # TO DO: Linear fully-connected layer
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        # TO DO: Fill in the model steps
        # NOTE: The LSTM outputs (output, (hidden, cell)) - hidden and cell are not used
        # NOTE: Use the hidden state from the final time step for the fc layer
        embedded = self.embedding(x)
        lstm_out, (hidden, cell) = self.lstm(embedded)
        out = self.fc(hidden[-1])
        return out


### Step 3: Define Training and Testing Loops (4 marks)

The next step is to define functions for the training and testing loops. For this case, we will only be calculating the loss at each epoch.

In [142]:
# TO DO: Define training loop (2 marks)
def train_model(model, train_loader, loss_fn, optimizer, num_epochs=5, device='cpu'):
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        for texts, labels in train_loader:
            texts, labels = texts.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(texts)
            labels = labels.float().unsqueeze(1)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        avg_loss = total_loss / len(train_loader)
        print(f'Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}')

# train_model(model, train_loader, criterion, optimizer, num_epochs=5)

In [143]:
# TO DO: Define testing loop (2 marks)

def test_model(model, test_loader, loss_fn, device='cpu'):
    model.eval()
    total = 0
    correct = 0
    total_loss = 0.0

    with torch.no_grad():
        for texts, labels in test_loader:
            texts, labels = texts.to(device), labels.to(device)

            labels = labels.float().unsqueeze(1)

            outputs = model(texts)
            loss = loss_fn(outputs, labels)
            total_loss += loss.item()

            predicted = (outputs >= 0).long()

            total += labels.size(0)
            correct += (predicted == labels.long()).sum().item()

    avg_loss = total_loss / len(test_loader)

    print(f"Test Loss: {avg_loss:.4f}")



### Step 4: Train and Evaluate (3 marks)

Now that we have all the necessary functions, we can select our hyperparameters, and train and evaluate our model. For this case, since we are not comparing different models, we do not need a validation set.

In [144]:
# Hyperparameters
VOCAB_SIZE = len(vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = 1 # Binary classification
NUM_LAYERS = 1

In [145]:
# TO DO: Create model object (1 mark)
model = LSTMClassifier(VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, NUM_LAYERS)

In [146]:
import torch.optim as optim

# Use GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

LSTMClassifier(
  (embedding): Embedding(10001, 100)
  (lstm): LSTM(100, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=1, bias=True)
)

Since this case is binary optimization, we will use the binary cross entropy criterion, `BCEWithLogitsLoss()`. This model is similar to Cross Entropy, but uses a sigmoid layer instead of a softmax layer. For the optimization function, we will use Adam with a learning rate of 0.01.

In [147]:
# TO DO: Define optimization model and criterion (1 mark)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

We can now run our training and testing loops. Since this takes a long time to run, we will set the number of epochs to 5. Print out the training and testing losses.

In [148]:
# TO DO: Run training and testing loops and print losses for each epoch (1 mark)
train_model(model, train_loader, criterion, optimizer, num_epochs=5, device=device)

test_model(model, test_loader, criterion, device=device)

Epoch 1/5, Loss: 0.6952
Epoch 2/5, Loss: 0.5704
Epoch 3/5, Loss: 0.4342
Epoch 4/5, Loss: 0.3799
Epoch 5/5, Loss: 0.3480
Test Loss: 0.4824


## Part 2: Questions and Process Description

### Questions (12 marks)

1. Do you think this model worked well to classify the data? Why or why not? Can you make a good decision about this only using loss data?
1. What could you do to further improve the results? Provide two suggestions.
1. Why does a simple RNN often underperform compared to LSTM or GRU on long text sequences such as IMDB reviews?
1. Why does the embedding layer improve performance compared to one-hot encoding?
1. If we switched to character-level input instead of word-level, what changes would we expect in performance and training time?
1. How does vocabulary size influence model performance and generalization?

*ANSWER HERE*

1.  I think the model worked well as through the epochs the loss went down without diverging. We can't really make a good decision purely based on the testing loss because this does not really state how accurate the prediction of the model are.
2. To improve the results we could train the model for more epochs which would allow the loss to continue to decrease. Another thing we could do is increase model complexity by adding more LSTM layers or increasing the hidden size.
3. Using a simple RNN would cause problems with vanishing and exploding gradients. Using LSTM or GRU would allow the model to capture longer patterns.
4. Doing one-hot encoding for each word would create vectors that are very big that do not contain much information. Using embeddings creates low dimensional vectors where similar words will have similar representations.
5. Switching to character level input would drastically increase the training time dramatically since the sequences would be much longer. The performance would more likely decrease since the current model parameters would unlikely be able to capture meaningful patterns.
6. Having low vocabulary would decrease training time but likely underfit. A high vocabulary would increase the training time and tend to overfit.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*

1. I used the examples provided on D2L. I used ChatGPT to help provide examples of using the the imported libraries and to help explain what specifics steps were doing from the D2l examples.
2. I completed the steps in order from beginning to end however I needed to go back and fix my dataprocessing step for my neural network because there were errors that caused the model to be incorrectly trained.
3. For some of the generative AI I used I asked it to explain what the examples provided on D2L were doing.
4. I had some difficulty understanding how the data should look after transformed into a vector.

## Part 3: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I liked that we got to go through the whole process of training a linear model along with a neural network that allowed us to see the difference between them and see how much better the neural network performed. I would have liked more experience with the data processing side of things so that I could have understood better what it means to clean data and represent it in the best way for the specific model. Maybe this will be covered in later assignments but it felt a bit rushed in this one.