# HW3 - RNN


## Homework Overview
In this homework, you will learn to implement, train, and evaluate Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) models on a text classification task using a dataset of IMDB movie reviews, and compare them.

**NOTE : Be sure to answer the analytical questions at the end of the notebook as well.**

In [1]:
# Imports
import nltk
nltk.download('stopwords')
import os
import random
import re
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk import wordpunct_tokenize
from tqdm import tqdm
from sklearn.metrics import accuracy_score, classification_report
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import DataLoader
from IPython.core.display import display, HTML
from datasets import load_dataset

# Enable tqdm progress bar in pandas
tqdm.pandas()

# Set device (GPU or CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Using device: cuda


# Dataset

In this section, we’ll load the IMDB dataset and preprocess the data to make it suitable for training RNN and LSTM models.

## Load Dataset
Description of Dataset: The IMDB movie reviews dataset consists of reviews along with their labels (positive or negative sentiment). Each review is a sentence or paragraph of text.

Download the Dataset: We will use a Huggingface to download the dataset into our environment.

In [2]:
# Load the IMDb dataset (Hugging Face)
dataset = load_dataset("imdb")

# Make sure 'data' folder exists
os.makedirs('data', exist_ok=True)

# Combine train and test splits into one DataFrame
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])
full_df = pd.concat([train_df, test_df], ignore_index=True)

# Save combined DataFrame to CSV
DATA_PATH = 'data/imdb_reviews.csv'
full_df.to_csv(DATA_PATH, index=False)

print(f"✅ Saved combined IMDb reviews to: {DATA_PATH}")
print(f"Total samples: {len(full_df)}")

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

✅ Saved combined IMDb reviews to: data/imdb_reviews.csv
Total samples: 50000


In [3]:
# Show 5 random samples from the DataFrame
print(full_df.sample(5))

                                                    text  label
48263  I found it charming! Nobody else but Kiarostam...      1
28205  I'm usually not inclined to write reviews abou...      0
25235  i came across this film on the net by fluke an...      0
8808   Let me start out by saying I can enjoy just ab...      0
232    This is really terrible.<br /><br />The only r...      0


## Preprocessing

For our models to work effectively, we need to preprocess the text data by cleaning it and converting words to integer indices for training.Preproces steps
such as Tokenization and Cleaning , Replacing Rare Words , Build Vocabulary , Convert Tokens to Indices and Prepare Data for Training.

**NOTE : Do not alter the structure of this preprocessing code, as it aligns with other parts of the notebook.However, minor adjustments for compatibility with your code are allowed if needed.**

In [4]:
def tokenize(text, stop_words):
    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower()
    tokens = wordpunct_tokenize(text)
    tokens = [token for token in tokens if token not in stop_words]
    return tokens

In [5]:
def remove_rare_words(tokens, common_tokens, max_len):
    return [token if token in common_tokens
            else '<UNK>' for token in tokens][-max_len:]

In [6]:
def load_and_preprocess_data(data_path, max_vocab, max_len):
    df = pd.read_csv(data_path)
    stop_words = set(stopwords.words('english'))

    # Clean and tokenize
    df['tokens'] = df['text'].apply(lambda x: tokenize(x, stop_words))

    # Replace rare words with <UNK>
    all_tokens = [token for tokens in df['tokens'] for token in tokens]
    common_tokens = set(list(zip(*Counter(all_tokens).most_common(max_vocab)))[0])
    df['tokens'] = df['tokens'].apply(lambda x: remove_rare_words(x, common_tokens, max_len))

    # Remove sequences with only <UNK>
    df = df[df['tokens'].apply(lambda tokens: any(token != '<UNK>' for token in tokens))]

    # Build vocab
    vocab = sorted(set([token for tokens in df['tokens'] for token in tokens]))
    token2idx = {token: idx for idx, token in enumerate(vocab)}
    token2idx['<PAD>'] = len(token2idx)

    # Index tokens
    df['indexed_tokens'] = df['tokens'].apply(lambda tokens: [token2idx[token] for token in tokens])

    return df['indexed_tokens'].tolist(), df['label'].tolist(), token2idx

In [7]:
# How many of the most common vocab words to keep
# Uncommon words get replaced with unknown token <UNK>
max_vocab = 2500

# How many tokens long each sequence will be cut to
# Shorter sequences will get the padding token <PAD>
max_len = 100

sequences, targets, token2idx = load_and_preprocess_data(DATA_PATH, max_vocab, max_len)


In [8]:
def split_data(sequences, targets, valid_ratio=0.05, test_ratio=0.05):
    total_size = len(sequences)
    test_size = int(total_size * test_ratio)
    valid_size = int(total_size * valid_ratio)
    train_size = total_size - valid_size - test_size

    train_sequences, train_targets = sequences[:train_size], targets[:train_size]
    valid_sequences, valid_targets = sequences[train_size:train_size + valid_size], targets[train_size:train_size + valid_size]
    test_sequences, test_targets = sequences[train_size + valid_size:], targets[train_size + valid_size:]

    return train_sequences, train_targets, valid_sequences, valid_targets, test_sequences, test_targets

In [9]:
train_sequences, train_targets, valid_sequences, valid_targets, test_sequences, test_targets = split_data(sequences, targets)

In [10]:
def collate(batch):
    inputs, targets = zip(*batch)
    inputs_padded = pad_sequences(inputs, padding_val=token2idx['<PAD>'])
    return torch.LongTensor(inputs_padded), torch.LongTensor(targets)

In [11]:
def pad_sequences(sequences, padding_val=0, pad_left=False):
    """Pad a list of sequences to the same length with a padding_val."""
    sequence_length = max(len(sequence) for sequence in sequences)
    if not pad_left:
        return [sequence + [padding_val] * (sequence_length - len(sequence)) for sequence in sequences]
    return [[padding_val] * (sequence_length - len(sequence)) + sequence for sequence in sequences]

In [12]:
batch_size = 256
train_data = list(zip(train_sequences, train_targets))
valid_data = list(zip(valid_sequences, valid_targets))
test_data = list(zip(test_sequences, test_targets))

train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=collate)
valid_loader = DataLoader(valid_data, batch_size=batch_size, shuffle=False, collate_fn=collate)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False, collate_fn=collate)

# RNN section

### RNN with nn.RNN
Implement a basic RNN model using PyTorch's built-in nn.RNN.
Define layers: embedding, RNN, and fully connected.

In [13]:
class RNNClassifier(nn.Module):
    def __init__(self, output_size, hidden_size, vocab_size,
                 device=device, n_layers=1,
                 embedding_dimension=50):
        super(RNNClassifier, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.device = device
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dimension, padding_idx=token2idx['<PAD>'])

        ###################################### TODO #####################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
        """Define Needed Layers """
        
        self.rnn = nn.RNN(embedding_dimension, hidden_size, n_layers, batch_first=True)
        
        self.fc = nn.Linear(hidden_size, output_size)
        
        #################################################################################
        #                                   THE END                                     #
        #################################################################################

    def forward(self, inputs):
        ###################################### TODO #####################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
                """
                Implements the forward pass: first, embed the input tokens, then pass
                the embeddings through the RNN layer to capture sequential dependencies.
                Finally, use fully connected layers to output class probabilities.
                """
                x = self.embedding(inputs)
                h0 = torch.zeros(self.n_layers, x.size(0), self.hidden_size).to(x.device)

                out, _ = self.rnn(x, h0)
                # out: batch_size, seq_length, hidden_size

                out = out[:, -1, :]
                out = self.fc(out)


        #################################################################################
        #                                   THE END                                     #
        #################################################################################
                return out # probabilities for each class in the output.

### Train model

In this section, you should train model for multiple epochs on the training data and evaluate it on the validation data after each epoch, reporting the model's accuracy. Ensure that the model is set to train mode during training and switched to eval mode for evaluation on the validation data. The objective is to implement the training loop and, at the next , compute and report the final accuracy on the test data.

**Note**: You are not allowed to use library-built trainer functions in this section; the training loop should be implemented manually.

**Note**: To implement the training loop, you have the option to create a single train_model function that trains a model over multiple epochs, calculates training and validation accuracy, and logs the losses. Once written, this function can be reused for all RNN and LSTM models, allowing you to simply call it with different model instances for training. Reusing the function in this way will ensure that you receive credit for the training section of each subsequent model without needing to write separate loops , with just the correct function call.








In [14]:
# TODO: edit this method
def train_model(model,
                train_loader,
                valid_loader,
                criterion,
                optimizer,
                n_epochs,
                device=device):
    """
    Trains `model` for n_epochs, evaluates on valid_loader each epoch,
    and finally reports accuracy on test_loader.
    """
    model.to(device)
    
    for epoch in range(1, n_epochs+1):
        # ————— Training —————
        model.train()
        train_loss = 0.0
        train_correct = 0
        train_total = 0
        
        for inputs, targets in train_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            
            optimizer.zero_grad()
            outputs = model(inputs)                      # forward pass
            loss = criterion(outputs, targets)           # compute loss
            loss.backward()                              # backprop
            optimizer.step()                             # gradient step
            
            # accumulate stats
            train_loss += loss.item() * inputs.size(0)
            _, preds = torch.max(outputs, dim=1)
            train_correct += (preds == targets).sum().item()
            train_total += targets.size(0)
        
        train_loss /= train_total
        train_acc = train_correct / train_total
        
        # ————— Validation —————
        model.eval()
        val_loss = 0.0
        val_correct = 0
        val_total = 0
        
        with torch.no_grad():
            for inputs, targets in valid_loader:
                inputs, targets = inputs.to(device), targets.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, targets)
                
                val_loss += loss.item() * inputs.size(0)
                _, preds = torch.max(outputs, dim=1)
                val_correct += (preds == targets).sum().item()
                val_total += targets.size(0)
        
        val_loss /= val_total
        val_acc = val_correct / val_total
        
        print(f"Epoch {epoch}/{n_epochs}  "
              f"Train Loss: {train_loss:.4f}  Train Acc: {train_acc:.4f}  |  "
              f"Val Loss: {val_loss:.4f}  Val Acc: {val_acc:.4f}")
    
    return model


In [15]:
# Binary classification problem
num_classes = 2

# the model underfits when hidden_dim=128, so we use 256 instead.
hidden_dim = 256

# num_layers seems to have little or no effect on accuracy. the model does fine even if this hyperparameter is set to 1.
rnn = RNNClassifier(num_classes, hidden_dim, len(token2idx), n_layers=1)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(rnn.parameters(), lr=3e-5)

In [16]:
############################# TODO #############################
# TODO: Implement the training loop
################################################################


n_epochs  = 100

rnn = train_model(
    rnn,
    train_loader,
    valid_loader,
    criterion,
    optimizer,
    n_epochs,
    device
)


Epoch 1/100  Train Loss: 0.6879  Train Acc: 0.5544  |  Val Loss: 0.8103  Val Acc: 0.0584
Epoch 2/100  Train Loss: 0.6833  Train Acc: 0.5630  |  Val Loss: 0.8062  Val Acc: 0.0800
Epoch 3/100  Train Loss: 0.6807  Train Acc: 0.5677  |  Val Loss: 0.7975  Val Acc: 0.1044
Epoch 4/100  Train Loss: 0.6781  Train Acc: 0.5724  |  Val Loss: 0.8162  Val Acc: 0.1248
Epoch 5/100  Train Loss: 0.6619  Train Acc: 0.5997  |  Val Loss: 0.7973  Val Acc: 0.3485
Epoch 6/100  Train Loss: 0.6424  Train Acc: 0.6316  |  Val Loss: 0.6975  Val Acc: 0.5634
Epoch 7/100  Train Loss: 0.6349  Train Acc: 0.6455  |  Val Loss: 0.7084  Val Acc: 0.5978
Epoch 8/100  Train Loss: 0.6269  Train Acc: 0.6585  |  Val Loss: 0.6599  Val Acc: 0.6631
Epoch 9/100  Train Loss: 0.6228  Train Acc: 0.6649  |  Val Loss: 0.7876  Val Acc: 0.5250
Epoch 10/100  Train Loss: 0.6189  Train Acc: 0.6699  |  Val Loss: 0.6781  Val Acc: 0.6535
Epoch 11/100  Train Loss: 0.6162  Train Acc: 0.6729  |  Val Loss: 0.6999  Val Acc: 0.6214
Epoch 12/100  Train

### RNN from Scratch
Implement an RNN from scratch by creating a custom RNN cell and a model that stacks these cells over time.

In [17]:
class CustomRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(CustomRNNCell, self).__init__()
        self.hidden_size = hidden_size
        ###################################### TODO #####################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
                # "Define Input-to-Hidden and Hidden-to-Hidden Layers"""
        self.input_size = input_size

        self.x2h = nn.Linear(input_size, hidden_size, bias=True)
        self.h2h = nn.Linear(hidden_size, hidden_size, bias=True)
        

        #################################################################################
        #                                   THE END                                     #
        #################################################################################

    def forward(self, input, hidden):
        ###################################### TODO #####################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
                """
                Implements the forward pass: combines the input and previous hidden state
                to calculate the new hidden state for this RNN cell.
                
                # Inputs:
                #       input: of shape (batch_size, input_size)
                #       hidden: of shape (batch_size, hidden_size)
                # Output:
                #       hidden: of shape (batch_size, hidden_size)
                """
                if hidden is None:
                    hidden = input.new_zeros(input.size(0), self.hidden_size)
        
                hidden = (self.x2h(input) + self.h2h(hidden))
        
                hidden = torch.tanh(hidden)


        #################################################################################
        #                                   THE END                                     #
        #################################################################################
                return hidden


In [18]:
class CustomRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers, output_size):
        super(CustomRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=token2idx['<PAD>'])
        ###################################### TODO #####################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
                # """Define Custom RNN Cell and Fully Connected Layers"""
        self.input_size = embedding_dim
        self.num_layers = num_layers
        self.output_size = output_size

        self.rnn_cell_list = nn.ModuleList()

        self.rnn_cell_list.append(CustomRNNCell(self.input_size, self.hidden_size))
        for l in range(1, self.num_layers):
            self.rnn_cell_list.append(CustomRNNCell(self.hidden_size, self.hidden_size))

        self.fc = nn.Linear(self.hidden_size, self.output_size)

        #################################################################################
        #                                   THE END                                     #
        #################################################################################


    def forward(self, inputs):
        ###################################### TODO #####################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
                # """
                # Implements the forward pass: performs embedding lookup, iterates through each
                # time step, and passes embeddings through the custom RNN cell. Finally,
                # applies the fully connected layers to output class probabilities.

                # # Input of shape (batch_size, seqence length, input_size)
                # #
                # # Output of shape (batch_size, output_size)
                # """

        x = self.embedding(inputs)
    
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)

        outs = []

        hidden = list()
        for layer in range(self.num_layers):
            hidden.append(h0[layer, :, :])

        for t in range(x.size(1)):

            for layer in range(self.num_layers):

                if layer == 0:
                    hidden_l = self.rnn_cell_list[layer](x[:, t, :], hidden[layer])
                else:
                    hidden_l = self.rnn_cell_list[layer](hidden[layer - 1],hidden[layer])
                hidden[layer] = hidden_l

                hidden[layer] = hidden_l

            outs.append(hidden_l)

        # Take only last time step. Modify for seq to seq
        out = outs[-1]

        out = self.fc(out)
        

        #################################################################################
        #                                   THE END                                     #
        #################################################################################
        return out # probabilities for each class in the output.

### Train model

In this section, you should train model for multiple epochs on the training data and evaluate it on the validation data after each epoch, reporting the model's accuracy. Ensure that the model is set to train mode during training and switched to eval mode for evaluation on the validation data. The objective is to implement the training loop and, at the next , compute and report the final accuracy on the test data.

In [20]:
vocab_size = len(token2idx)
embedding_dim = 50
hidden_size = 256
num_layers = 1
output_size = 2


customRNN = CustomRNN(vocab_size, embedding_dim , hidden_size, num_layers, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(customRNN.parameters(), lr=2e-5)

In [21]:
############################# TODO #############################
# TODO: Implement the training loop
################################################################


n_epochs  = 100

customRNN = train_model(
    customRNN,
    train_loader,
    valid_loader,
    criterion,
    optimizer,
    n_epochs,
    device
)


Epoch 1/100  Train Loss: 0.6911  Train Acc: 0.5480  |  Val Loss: 0.8034  Val Acc: 0.0972
Epoch 2/100  Train Loss: 0.6862  Train Acc: 0.5570  |  Val Loss: 0.8173  Val Acc: 0.0952
Epoch 3/100  Train Loss: 0.6840  Train Acc: 0.5634  |  Val Loss: 0.7957  Val Acc: 0.1084
Epoch 4/100  Train Loss: 0.6823  Train Acc: 0.5649  |  Val Loss: 0.8116  Val Acc: 0.1040
Epoch 5/100  Train Loss: 0.6810  Train Acc: 0.5679  |  Val Loss: 0.8110  Val Acc: 0.1088
Epoch 6/100  Train Loss: 0.6799  Train Acc: 0.5699  |  Val Loss: 0.8094  Val Acc: 0.1184
Epoch 7/100  Train Loss: 0.6787  Train Acc: 0.5713  |  Val Loss: 0.8082  Val Acc: 0.1196
Epoch 8/100  Train Loss: 0.6776  Train Acc: 0.5730  |  Val Loss: 0.8030  Val Acc: 0.1325
Epoch 9/100  Train Loss: 0.6765  Train Acc: 0.5745  |  Val Loss: 0.8118  Val Acc: 0.1341
Epoch 10/100  Train Loss: 0.6743  Train Acc: 0.5769  |  Val Loss: 0.8024  Val Acc: 0.2689
Epoch 11/100  Train Loss: 0.6643  Train Acc: 0.5989  |  Val Loss: 0.7889  Val Acc: 0.3585
Epoch 12/100  Train

### evaluate RNN models on test set
To complete evaluate_on_test, loop through the test data to get predictions, calculate accuracy, and print a classification report for model evaluation. This function can be used to evaluate the performance LSTM models too.

**NOTE : to earn full marks for this section, you must adjust the network's hyperparameters so that each rnn models achieves at least 70% accuracy on the test data. If you achieve less than the required accuracy, consider adjusting your training loop and hyperparameters, such as the hidden state size and learning rate, to improve model performance.**

In [22]:
# Evaluate on test data
def evaluate_on_test(model, test_loader):
    model.eval()
    ############################# TODO #############################
    # TODO: Iterate over the test_loader, obtain model predictions,
    # calculate accuracy, and generate a classification report.
    ################################################################
    y_true_test = []
    y_pred_test = []

    with torch.no_grad():
        for texts, labels in test_loader:
            texts, labels = texts.to(device), labels.to(device)
            outputs = model(texts)
            _, preds = torch.max(outputs, dim=1)

            y_true_test.extend(labels.cpu().tolist())
            y_pred_test.extend(preds.cpu().tolist())
            
    test_acc = accuracy_score(y_true_test, y_pred_test)
    print(f"\nTest Accuracy: {test_acc:.4f}\n")

    
    print(classification_report(y_true_test, y_pred_test))

In [23]:
# Evaluate both RNN models on the test dataset
evaluate_on_test(rnn, test_loader)
evaluate_on_test(customRNN, test_loader)


Test Accuracy: 0.8527

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.85      0.92      2499

    accuracy                           0.85      2499
   macro avg       0.50      0.43      0.46      2499
weighted avg       1.00      0.85      0.92      2499


Test Accuracy: 0.8015

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.80      0.89      2499

    accuracy                           0.80      2499
   macro avg       0.50      0.40      0.44      2499
weighted avg       1.00      0.80      0.89      2499



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# LSTM section

### LSTM with nn.LSTM
Define an LSTM model using PyTorch's built-in nn.LSTM.

In [24]:
class LSTMClassifier(nn.Module):
    def __init__(self, output_size, hidden_size, vocab_size,
                 device, bidirectional=False, n_layers=1,
                 embedding_dimension=50):
        super(LSTMClassifier, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.device = device

        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dimension, padding_idx = token2idx['<PAD>'])

        ###################################### TODO #####################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
                # """Define the LSTM layer and fully connected layers"""
                #Your Code Here: Initialize an nn.LSTM layer and any required fully connected layers
        
        self.lstm = nn.LSTM(embedding_dimension, hidden_size, n_layers, batch_first=True, bidirectional=bidirectional)
        
        self.fc = nn.Linear(hidden_size, output_size)


        #################################################################################
        #                                   THE END                                     #
        #################################################################################

    def forward(self, inputs):
        # Initialize hidden state and cell state with zeros
        

        ###################################### TODO #####################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
                # """
                # Implements the forward pass: first, embed the input tokens, then pass
                # the embeddings through the LSTM layer to capture sequential dependencies.
                # Finally, use fully connected layers to output class probabilities.
                # """
                #Your Code Here

        
        x = self.embedding(inputs)

        hidden = torch.zeros(self.n_layers, inputs.size(0), self.hidden_size).to(self.device)
        cell_state = torch.zeros(self.n_layers, inputs.size(0), self.hidden_size).to(self.device)

        
        # h0 = torch.zeros(self.n_layers, x.size(0), self.hidden_size).to(x.device)
        # c0 = torch.zeros(self.n_layers, x.size(0), self.hidden_size).to(x.device)
        

        out, _ = self.lstm(x, (hidden, cell_state))
                # out: batch_size, seq_length, hidden_size

        out = out[:, -1, :]
        out = self.fc(out)
    
        
        #################################################################################
        #                                   THE END                                     #
        #################################################################################

        return out  # probabilities for each class in the output.

### Train model

In this section, you should train model for multiple epochs on the training data and evaluate it on the validation data after each epoch, reporting the model's accuracy. Ensure that the model is set to train mode during training and switched to eval mode for evaluation on the validation data. The objective is to implement the training loop and, at the next , compute and report the final accuracy on the test data.

**Note**: You are not allowed to use library-built trainer functions in this section; the training loop should be implemented manually.

**Note**: To implement the training loop, you have the option to create a single train_model function that trains a model over multiple epochs, calculates training and validation accuracy, and logs the losses. Once written, this function can be reused for all RNN and LSTM models, allowing you to simply call it with different model instances for training. Reusing the function in this way will ensure that you receive credit for the training section of each subsequent model without needing to write separate loops , with just the correct function call.

In [41]:
output_size = 2
hidden_size = 256
vocab_size = len(token2idx)


lstm = LSTMClassifier(output_size, hidden_size, vocab_size, device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(lstm.parameters(), lr=1e-3)

In [42]:
############################# TODO #############################
# TODO: Implement the training loop
################################################################

n_epochs  = 40

lstm = train_model(
    lstm,
    train_loader,
    valid_loader,
    criterion,
    optimizer,
    n_epochs,
    device
)


Epoch 1/40  Train Loss: 0.6450  Train Acc: 0.6211  |  Val Loss: 0.7778  Val Acc: 0.6319
Epoch 2/40  Train Loss: 0.4821  Train Acc: 0.7793  |  Val Loss: 0.6492  Val Acc: 0.6715
Epoch 3/40  Train Loss: 0.3841  Train Acc: 0.8327  |  Val Loss: 0.4433  Val Acc: 0.8043
Epoch 4/40  Train Loss: 0.3391  Train Acc: 0.8572  |  Val Loss: 0.4380  Val Acc: 0.8279
Epoch 5/40  Train Loss: 0.3039  Train Acc: 0.8741  |  Val Loss: 0.3775  Val Acc: 0.8235
Epoch 6/40  Train Loss: 0.2838  Train Acc: 0.8838  |  Val Loss: 0.4619  Val Acc: 0.7915
Epoch 7/40  Train Loss: 0.2602  Train Acc: 0.8957  |  Val Loss: 0.4760  Val Acc: 0.8111
Epoch 8/40  Train Loss: 0.2454  Train Acc: 0.9035  |  Val Loss: 0.3534  Val Acc: 0.8391
Epoch 9/40  Train Loss: 0.2189  Train Acc: 0.9155  |  Val Loss: 0.4281  Val Acc: 0.8175
Epoch 10/40  Train Loss: 0.2003  Train Acc: 0.9261  |  Val Loss: 0.3494  Val Acc: 0.8659
Epoch 11/40  Train Loss: 0.1748  Train Acc: 0.9361  |  Val Loss: 0.4644  Val Acc: 0.8199
Epoch 12/40  Train Loss: 0.144

### Custom LSTM from Scratch
Implement an LSTM from scratch by defining a LSTM cell and a model that combines these cells over the sequence.

In [27]:
class CustomLSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(CustomLSTMCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        ###################################### TODO #####################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
                # """Define Needed Layers """
        
        self.xh = nn.Linear(input_size, hidden_size * 4)
        self.hh = nn.Linear(hidden_size, hidden_size * 4)
        
        
        std = 1.0 / np.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std, std)

        #################################################################################
        #                                   THE END                                     #
        #################################################################################

    def forward(self, input, hidden, cell_state):
        ###################################### TODO #####################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
                # """Define Forward pass"""
                    # Inputs:
                    #       input: of shape (batch_size, input_size)
                    #       hx: of shape (batch_size, hidden_size)
                    # Outputs:
                    #       hy: of shape (batch_size, hidden_size)
                    #       cy: of shape (batch_size, hidden_size)


        gates = self.xh(input) + self.hh(hidden)

        # Get gates (i_t, f_t, g_t, o_t)
        i_t, f_t, c_g, o_g = gates.chunk(4, 1)

        input_gate = torch.sigmoid(i_t)
        forget_gate = torch.sigmoid(f_t)
        cell_gate = torch.tanh(c_g)
        output_gate = torch.sigmoid(o_g)

        
        # New cell state
        cell_state = forget_gate * cell_state + input_gate * cell_gate
        # New hidden state
        hidden = output_gate * torch.tanh(cell_state)

        #################################################################################
        #                                   THE END                                     #
        #################################################################################

        return hidden, cell_state # New hidden state , New cell state


In [31]:
class CustomLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size):
        super(CustomLSTM, self).__init__()
        self.input_size = embedding_dim
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=token2idx['<PAD>'])
        ###################################### TODO #####################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
                # """Define Needed Layers """
        
        self.output_size = output_size
        self.rnn_cell = CustomLSTMCell(self.input_size, self.hidden_size)
        self.fc = nn.Linear(self.hidden_size, self.output_size)

        #################################################################################
        #                                   THE END                                     #
        #################################################################################

    def forward(self, inputs):
        # Initialize hidden state and cell state with zeros
        hidden = torch.zeros(inputs.size(0), self.hidden_size).to(inputs.device)
        cell_state = torch.zeros(inputs.size(0), self.hidden_size).to(inputs.device)

        ###################################### TODO #####################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
                # """
                # Implements the forward pass: first, embed the input tokens, then pass
                # the embeddings through the LSTM layer to capture sequential dependencies.
                # Finally, use fully connected layers to output class probabilities.
                # """
                #Your Code Here

        x = self.embedding(inputs)
        

        for t in range(x.size(1)):
            hidden, cell_state = self.rnn_cell(x[:, t, :], hidden, cell_state)

        out = self.fc(hidden)
            

        #################################################################################
        #                                   THE END                                     #
        #################################################################################

        return out  # probabilities for each class in the output.


### Train model

In this section, you should train model for multiple epochs on the training data and evaluate it on the validation data after each epoch, reporting the model's accuracy. Ensure that the model is set to train mode during training and switched to eval mode for evaluation on the validation data. The objective is to implement the training loop and, at the next , compute and report the final accuracy on the test data.

In [32]:
vocab_size = len(token2idx)
embedding_dim = 50
hidden_size = 256
output_size = 2




customLSTM = CustomLSTM(vocab_size, embedding_dim, hidden_size, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(customLSTM.parameters(), lr=5e-4)

In [33]:
############################# TODO #############################
# TODO: Implement the training loop
################################################################

n_epochs  = 35

customLSTM = train_model(
    customLSTM,
    train_loader,
    valid_loader,
    criterion,
    optimizer,
    n_epochs,
    device
)



Epoch 1/35  Train Loss: 0.6758  Train Acc: 0.5758  |  Val Loss: 0.7898  Val Acc: 0.1797
Epoch 2/35  Train Loss: 0.5967  Train Acc: 0.6811  |  Val Loss: 0.5131  Val Acc: 0.7307
Epoch 3/35  Train Loss: 0.4669  Train Acc: 0.7848  |  Val Loss: 0.3278  Val Acc: 0.8743
Epoch 4/35  Train Loss: 0.3955  Train Acc: 0.8252  |  Val Loss: 0.3828  Val Acc: 0.8443
Epoch 5/35  Train Loss: 0.3588  Train Acc: 0.8452  |  Val Loss: 0.5265  Val Acc: 0.7991
Epoch 6/35  Train Loss: 0.3326  Train Acc: 0.8599  |  Val Loss: 0.7293  Val Acc: 0.6487
Epoch 7/35  Train Loss: 0.3140  Train Acc: 0.8674  |  Val Loss: 0.2683  Val Acc: 0.8719
Epoch 8/35  Train Loss: 0.2941  Train Acc: 0.8784  |  Val Loss: 0.4244  Val Acc: 0.8215
Epoch 9/35  Train Loss: 0.2801  Train Acc: 0.8853  |  Val Loss: 0.4830  Val Acc: 0.7623
Epoch 10/35  Train Loss: 0.2687  Train Acc: 0.8888  |  Val Loss: 0.2607  Val Acc: 0.9012
Epoch 11/35  Train Loss: 0.2613  Train Acc: 0.8941  |  Val Loss: 0.4722  Val Acc: 0.7823
Epoch 12/35  Train Loss: 0.250

### evaluate LSTM models on test set
To complete evaluate_on_test, loop through the test data to get predictions, calculate accuracy, and print a classification report for model evaluation.
you can use the `evaluate_on_test` function implemented in the previous section. Alternatively, you may write a new function to conduct this evaluation.ensure report the "classification_report" of both LSTM models.

**NOTE : to earn full marks for this section, you must adjust the network's hyperparameters so that each lstm models achieves at least 80% accuracy on the test data. If you achieve less than the required accuracy, consider adjusting your training loop and hyperparameters, such as the hidden state size and learning rate, to improve model performance.**

In [43]:
# Evaluate both LSTM models on the test dataset
evaluate_on_test(lstm, test_loader)
evaluate_on_test(customLSTM, test_loader)


Test Accuracy: 0.8355

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.84      0.91      2499

    accuracy                           0.84      2499
   macro avg       0.50      0.42      0.46      2499
weighted avg       1.00      0.84      0.91      2499



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Test Accuracy: 0.8575

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.86      0.92      2499

    accuracy                           0.86      2499
   macro avg       0.50      0.43      0.46      2499
weighted avg       1.00      0.86      0.92      2499



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Testing RNN and LSTM Models on a New Review

In [35]:
# Example review
review = "It is no wonder that the film has such a high rating, it is quite literally breathtaking. What can I say that hasn't said before? Not much, it's the story, the acting, the premise, but most of all, this movie is about how it makes you feel. Sometimes you watch a film, and can't remember it days later, this film loves with you, once you've seen it, you don't forget."


## Preprocess the test Review
To prepare the review for the model, we need to follow similar preprocessing steps as we did for the dataset:

Remove special characters and convert the text to lowercase.
Tokenize the text into individual words.
Remove stopwords to focus only on meaningful words.
Convert tokens to indices based on the token2idx dictionary created earlier.
Pad or truncate the sequence to a length of max_len .


In [36]:
def tokenize(text, stop_words):
    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower()
    tokens = wordpunct_tokenize(text)
    tokens = [token for token in tokens if token not in stop_words]
    return tokens

def remove_rare_words(tokens, common_tokens, max_len):
    return [token if token in common_tokens
            else '<UNK>' for token in tokens][-max_len:]

# How many of the most common vocab words to keep
# Uncommon words get replaced with unknown token <UNK>
max_vocab = 2500

# How many tokens long each sequence will be cut to
# Shorter sequences will get the padding token <PAD>
max_len = 100

sequences, targets, token2idx = load_and_preprocess_data(DATA_PATH, max_vocab, max_len)



def load_and_preprocess_data(data_path, max_vocab, max_len):
    df = pd.read_csv(data_path)
    stop_words = set(stopwords.words('english'))

    # Clean and tokenize
    df['tokens'] = df['text'].apply(lambda x: tokenize(x, stop_words))

    # Replace rare words with <UNK>
    all_tokens = [token for tokens in df['tokens'] for token in tokens]
    common_tokens = set(list(zip(*Counter(all_tokens).most_common(max_vocab)))[0])
    df['tokens'] = df['tokens'].apply(lambda x: remove_rare_words(x, common_tokens, max_len))

    # Remove sequences with only <UNK>
    df = df[df['tokens'].apply(lambda tokens: any(token != '<UNK>' for token in tokens))]

    # Build vocab
    vocab = sorted(set([token for tokens in df['tokens'] for token in tokens]))
    token2idx = {token: idx for idx, token in enumerate(vocab)}
    token2idx['<PAD>'] = len(token2idx)

    # Index tokens
    df['indexed_tokens'] = df['tokens'].apply(lambda tokens: [token2idx[token] for token in tokens])

    return df['indexed_tokens'].tolist(), df['label'].tolist(), token2idx

In [37]:
import torch
import re
from nltk import wordpunct_tokenize
from nltk.corpus import stopwords


# Preprocessing function
def preprocess_text(text, stop_words, token2idx, max_len):

    ########################### TODO ###########################
    # Step 1: Clean and lowercase the input text
    ################################################################

    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower()

    ########################### TODO ###########################
    # Step 2: Tokenize the text into words
    # - Use wordpunct_tokenize to split the cleaned text into individual word tokens.
    ############################################################

    tokens = wordpunct_tokenize(text)

    ########################### TODO ###########################
    # Step 3: Remove stopwords from the token list
    ############################################################

    tokens = [token for token in tokens if token not in stop_words]

    ########################### TODO ###########################
    # Step 4: Convert tokens to indices based on the token2idx dictionary
    # - For each token in the list, get the corresponding index from the token2idx dictionary.
    # - If a token is not found in token2idx, replace it with the index of '<UNK>'.
    ################################################################

    unk_idx = token2idx['<UNK>']
    tokens_idx = [token2idx.get(token, unk_idx) for token in tokens]

    ########################### TODO ###########################
    # Step 5: Pad or truncate the tokens_idx list to the desired max_len
    ############################################################

    if len(tokens_idx) >= max_len:
        # truncate to max_len
        tokens_idx = tokens_idx[:max_len]
    else:
        # pad with the <PAD> index
        pad_idx = token2idx['<PAD>']
        tokens_idx = tokens_idx + [pad_idx] * (max_len - len(tokens_idx))

    ########################### End of TODOs ###########################

    return tokens_idx  # Return the processed list of indices

# Get stopwords
stop_words = set(stopwords.words('english'))

########################### TODO ###########################
# Preprocess the review
review_indices = preprocess_text(review, stop_words, token2idx, max_len)
############################################################

# Convert the indices to a tensor and move it to the device (GPU or CPU)
input_tensor = torch.LongTensor([review_indices]).to(device)

## Make Predictions
Now that we have preprocessed the review, use both the RNN and LSTM models to make predictions on the sentiment of the review.

Set the model to evaluation mode to prevent updates during inference.
Predict the sentiment class by passing the input_tensor to the model.
Interpret the prediction as either "Positive" or "Negative" based on the model's output.

In [38]:
def predict_sentiment(model, input_tensor, model_name="Model"):
    model.eval()  # Set the model to evaluation mode
    ############################# TODO #############################
    # TODO: Perform a forward pass with the model on the input_tensor,
    # get the predicted class label, and map it to "Positive" or "Negative".
    ################################################################

    with torch.no_grad():
        logits = model(input_tensor)                     # (1, 2) raw scores
        probs  = torch.softmax(logits, dim=1)            # (1, 2) probabilities
        pred_i = torch.argmax(probs, dim=1).item()       # scalar 0 or 1

    class_label = "Positive" if pred_i == 1 else "Negative"

    print(f"The predicted class for the review by {model_name} is: {class_label}")

In [44]:
# Make predictions using with "predict_sentiment" function for each of four models above

predict_sentiment(rnn, input_tensor, "RNN")
predict_sentiment(customRNN, input_tensor, "CustomRNN")
predict_sentiment(lstm, input_tensor, "LSTM")
predict_sentiment(customLSTM, input_tensor, "CustomLSTM")



The predicted class for the review by RNN is: Positive
The predicted class for the review by CustomRNN is: Positive
The predicted class for the review by LSTM is: Positive
The predicted class for the review by CustomLSTM is: Positive


# Questions

[1] - Based on your observations, what do you think caused the difference in performance between the RNN and LSTM models (on test set)? Analyze this difference using the results from the notebook, and discuss where a simple RNN might perform better.
One of the main disadvantages of RNN is vanishing gradient, which limits its ability to retain past information. In contrast, LSTM handles long-term dependencies better. It can solve this problem by its special design. 
If the provided text is short with simple words and phrases, RNN may perform better.





[2] - If we increase max_len in the preprocessing step to 300, what changes in models (rnn & lstm ) performance would you expect, and why? Please *explain* and discuss the impact this may have on the learning process and the final results.
RNN: Each gradient term now contains up-to 300 Jacobian products, so stronger vanishing/exploding-gradient can occur. The RNN could learn farther-back dependencies, but in practice it tends to memorise local patterns and treat the rest as noise, causing over-fitting. 
LSTM: The LSTM can use information from much farther back: better sentiment understanding. but it needs more computation. 

I expect the LSTM’s accuracy stay the same or improve, but for RNN the test accuracy is decreased.