# Sentiment Analysis of IMBD Movie Ratings with RNN 

The notebook includes text preprocessing along with the embedding part and the training using customly written LSTM neural network

In [38]:
import re
import pandas as pd
from collections import Counter
import torch
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import torch.nn as nn
from tqdm import tqdm

In [None]:
root = "../IMDB-Dataset.csv" # path to a csv file downloaded form https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

## Text Preprocessing and Word Embedding

Let's have a look at our dataset that is currently in a CSV format

In [None]:
# Loading csv data
data_path = root
df = pd.read_csv(data_path)
df['sentiment'] = df['sentiment'].apply(lambda x:1 if x == 'positive' else 0)
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [None]:
# splitting dataset
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

To use in RNN tasks we need to tokenize our dataset and bring it into compatible format.

In [None]:
def tokenize(text):
    return [s.lower() for s in re.split(r'\W+', text) if s]


# Maximum tokens allowed per review.
max_tokens = 80

# Defining set of stopwords to remove from the vocabulary and token lists.
stop_words = {"a", "an", "and", "the"}


# Building the vocabulary using training data.
freqs = Counter()
for text in train_df['review']:
    tokens = [token for token in tokenize(text) if token not in stop_words][:max_tokens]
    freqs.update(tokens)

# Initializing the vocabulary with special tokens.
vocab = {'<eos>': 0, '<unk>': 1}
# Adding the 50 most common tokens from the training data.
for token, _ in freqs.most_common(50):
    vocab[token] = len(vocab)


# Mapping Tokens to unique indices
# if token does not exist in vocabulary we assign it to <unk> token
def tokens_to_indices(tokens, vocab):
    return [vocab.get(token, vocab['<unk>']) for token in tokens]

# Preparing data: Creating tuples (raw_text, tokens, token_indices, sentiment).
def prepare_data(df, vocab, max_tokens=40):
    data_list = []
    for _, row in df.iterrows():
        raw_text = row['review']
        sentiment = row['sentiment']
        tokens = tokenize(raw_text)[:max_tokens]  # truncating tokens
        indices = tokens_to_indices(tokens, vocab)
        data_list.append((raw_text, tokens, indices, sentiment))
    return data_list

In [None]:
freqs # Most frequent words in dictionary

Counter({'of': 86474,
         'to': 74070,
         'i': 69669,
         'is': 64651,
         'it': 61590,
         'this': 54924,
         'in': 53014,
         'br': 48456,
         'that': 40760,
         'was': 35086,
         'movie': 33781,
         's': 33139,
         'film': 25914,
         'as': 24935,
         'with': 24597,
         'for': 23975,
         'but': 23867,
         'on': 20404,
         't': 19048,
         'you': 18101,
         'not': 17667,
         'have': 17258,
         'one': 17105,
         'are': 16079,
         'be': 14942,
         'his': 14144,
         'at': 13339,
         'all': 13186,
         'he': 13020,
         'by': 12878,
         'so': 12551,
         'who': 12301,
         'from': 12063,
         'like': 11762,
         'they': 11576,
         'about': 11419,
         'has': 10365,
         'there': 10328,
         'just': 10079,
         'what': 9677,
         'my': 9453,
         'good': 9350,
         'or': 9174,
         'very': 91

In [88]:
train_data = prepare_data(train_df, vocab, max_tokens)
val_data = prepare_data(val_df, vocab, max_tokens)

In [None]:
train_data[0]

Now Let's Define our Dataset and Dataloader classes

In [89]:
class SentimentDataset(Dataset):
    def __init__(self, data):
        """
        Inputs:
            data: list of tuples (raw_text, tokens, token_indices, sentiment)
        """
        self.data = data
        # Sorting by token list length (largest first)
        self.data.sort(key=lambda x: len(x[1]), reverse=True)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, i):
        """
        Inputs:
            i: an integer index
        Outputs:
            data: A dictionary of {data, label}
        """
        _, _, indices, sentiment = self.data[i]
        return {
            'data': torch.tensor(indices).long(),
            'label': torch.tensor(sentiment).float()
        }


In [None]:
# Creating dataset objects.
train_dataset = SentimentDataset(train_data)
val_dataset = SentimentDataset(val_data)

In [111]:
print(f'Length of the datastet: {len(train_dataset)}')
print('Sample:')
train_dataset[0]

Length of the datastet: 40000
Sample:


{'data': tensor([10, 13, 41,  4,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 10,
          1,  1,  1,  1,  1,  1,  1,  1,  1, 49, 21,  1,  2,  1, 24,  1,  1, 33,
         23, 32,  1,  1, 10,  6,  5,  1,  1,  3,  1, 41,  1,  3,  1, 36, 25, 40,
          1,  1,  1, 17,  1,  1,  3,  1, 27,  1,  1, 19,  1,  1, 10, 38,  1,  1,
          1,  1,  8,  1,  1,  1, 19,  1]),
 'label': tensor(0.)}

In [None]:
# Defining a collate function for DataLoader to correctly handle batches.
def collate(batch):
    data = pad_sequence([item['data'] for item in batch])
    lengths = torch.tensor([len(item['data']) for item in batch])
    labels = torch.stack([item['label'] for item in batch])
    return {
        'data': data,
        'lengths': lengths,
        'label': labels
    }

In [113]:
batch_size = 16
train_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, collate_fn=collate)

In [114]:
next(iter(train_loader))

{'data': tensor([[ 1,  1, 41,  ...,  1,  1,  1],
         [ 1,  1,  1,  ...,  1,  1,  1],
         [ 1,  1,  1,  ...,  1,  1, 38],
         ...,
         [49,  1,  4,  ..., 36, 36,  1],
         [ 1, 38,  1,  ...,  1,  1,  1],
         [ 1, 47,  1,  ...,  3,  1, 14]]),
 'lengths': tensor([80, 80, 80, 80, 60, 80, 80, 80, 80, 80, 55, 80, 57, 80, 80, 80]),
 'label': tensor([1., 1., 0., 1., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0., 1., 0.])}

## Training and Evaluation

For training we will use our <code>RNNClassifier</code> that uses the output of last hidden layer of lstm to predict the label.
Have a look at the <code>RNNClassfier</code> class in <code>classfier.py</code> to get better understanding

Since RNNClassifier uses LSTM and Embedding, check  <code>Embedding</code> and <code>LSTM</code> classes in <code>layers.py</code> file to completely understand the implementation part 

In [118]:
from classifier import RNNClassifier

In [None]:

@torch.no_grad()
def compute_accuracy(model, data_loader):
    """Computes the accuracy of the model"""
    corrects = 0
    total = 0
    device = next(model.parameters())
    
    for i, x in enumerate(data_loader):
        input = x['data']
        lengths = x['lengths']
        label = x['label']
        pred = model(input, lengths)
        corrects += ((pred > 0.5) == label).sum().item()
        total += label.numel()
        
        if i > 0  and i % 100 == 0:
            print('Step {} / {}'.format(i, len(data_loader)))
    
    return corrects / total

In [None]:
model = RNNClassifier(num_embeddings=len(vocab), embedding_dim=20, hidden_size=32)
criterion = nn.BCELoss()  # Binary Cross Entropy Loss
#optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Finally, let's define the training function in accordance with pytorch's training pipeline 

In [123]:

def initialize_training(
    model,
    train_loader,
    val_loader,
    epochs=10,
    lr=0.001,
    criterion=None,
    optimizer_class=torch.optim.Adam
):
    """
    Initializes the training process and runs the training/validation loops.

    Args:
        model: PyTorch model to be trained.
        train_loader: DataLoader for training data (returns dict with keys ['data', 'label', 'lengths']).
        val_loader: DataLoader for validation data (same structure as train_loader).
        epochs: Number of epochs to train for.
        lr: Learning rate for the optimizer.
        criterion: Loss function to use (default: BCELoss).
        optimizer_class: Optimizer class to use (default: Adam).
    """
    if criterion is None:
        criterion = nn.BCELoss()  # Default to binary cross-entropy loss

    optimizer = optimizer_class(model.parameters(), lr=lr)

    # Training Loop
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        train_loop = tqdm(train_loader, desc=f'Training Epoch [{epoch + 1}/{epochs}]')
        scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=epochs * len(train_loader) / 5, gamma=0.7)

        
        for batch in train_loop:
            optimizer.zero_grad()
            
            # Extract data, labels, and lengths from batch
            sequences = batch['data']  # Input sequences
            labels = batch['label']    # Ground truth labels
            lengths = batch['lengths'] # Actual lengths of sequences
            
            # Forward pass
            outputs = model(sequences, lengths)  # Transpose sequences to (seq_len, batch_size)
            
            # Calculate loss and perform backpropagation
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
            train_loop.set_postfix(loss=loss.item())
            
        scheduler.step()
        train_loss /= len(train_loader)

        # Validation Phase
        model.eval()
        val_loss = 0
        correct = 0
        total = 0
        val_loop = tqdm(val_loader, desc=f'Validation Epoch [{epoch + 1}/{epochs}]')
        with torch.no_grad():
            for batch in val_loop:
                sequences = batch['data']  # Input sequences
                labels = batch['label']    # Ground truth labels
                lengths = batch['lengths'] # Actual lengths of sequences

                outputs = model(sequences, lengths)
                loss = criterion(outputs, labels)
                val_loss += loss.item()

                # Calculate accuracy
                predictions = (outputs > 0.5).float()  # Threshold at 0.5
                correct += (predictions == labels).sum().item()
                total += labels.size(0)
                val_loop.set_postfix(loss=loss.item())

        val_loss /= len(val_loader)
        accuracy = correct / total * 100

        print(f"Epoch {epoch + 1}/{epochs}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}, Accuracy: {accuracy:.2f}%")



initialize_training(model=model, train_loader=train_loader, val_loader=val_loader)

Training Epoch [1/10]: 100%|██████████| 2500/2500 [01:19<00:00, 31.51it/s, loss=0.658]
Validation Epoch [1/10]: 100%|██████████| 625/625 [00:05<00:00, 110.67it/s, loss=0.669]


Epoch 1/10, Train Loss: 0.6848, Val Loss: 0.6680, Accuracy: 59.76%


Training Epoch [2/10]: 100%|██████████| 2500/2500 [01:21<00:00, 30.66it/s, loss=0.565]
Validation Epoch [2/10]: 100%|██████████| 625/625 [00:05<00:00, 112.43it/s, loss=0.672]


Epoch 2/10, Train Loss: 0.6631, Val Loss: 0.6566, Accuracy: 61.41%


Training Epoch [3/10]: 100%|██████████| 2500/2500 [01:18<00:00, 31.75it/s, loss=0.679]
Validation Epoch [3/10]: 100%|██████████| 625/625 [00:05<00:00, 116.54it/s, loss=0.693]


Epoch 3/10, Train Loss: 0.6507, Val Loss: 0.6526, Accuracy: 61.97%


Training Epoch [4/10]: 100%|██████████| 2500/2500 [01:18<00:00, 31.74it/s, loss=0.772]
Validation Epoch [4/10]: 100%|██████████| 625/625 [00:06<00:00, 102.00it/s, loss=0.689]


Epoch 4/10, Train Loss: 0.6426, Val Loss: 0.6451, Accuracy: 62.78%


Training Epoch [5/10]: 100%|██████████| 2500/2500 [01:20<00:00, 30.92it/s, loss=0.627]
Validation Epoch [5/10]: 100%|██████████| 625/625 [00:05<00:00, 114.52it/s, loss=0.704]


Epoch 5/10, Train Loss: 0.6343, Val Loss: 0.6430, Accuracy: 62.67%


Training Epoch [6/10]: 100%|██████████| 2500/2500 [01:19<00:00, 31.37it/s, loss=0.561]
Validation Epoch [6/10]: 100%|██████████| 625/625 [00:05<00:00, 111.63it/s, loss=0.718]


Epoch 6/10, Train Loss: 0.6280, Val Loss: 0.6384, Accuracy: 62.90%


Training Epoch [7/10]: 100%|██████████| 2500/2500 [01:17<00:00, 32.19it/s, loss=0.592]
Validation Epoch [7/10]: 100%|██████████| 625/625 [00:05<00:00, 116.08it/s, loss=0.725]


Epoch 7/10, Train Loss: 0.6218, Val Loss: 0.6366, Accuracy: 63.14%


Training Epoch [8/10]: 100%|██████████| 2500/2500 [01:17<00:00, 32.35it/s, loss=0.53] 
Validation Epoch [8/10]: 100%|██████████| 625/625 [00:05<00:00, 123.77it/s, loss=0.705]


Epoch 8/10, Train Loss: 0.6156, Val Loss: 0.6378, Accuracy: 63.27%


Training Epoch [9/10]: 100%|██████████| 2500/2500 [01:14<00:00, 33.41it/s, loss=0.784]
Validation Epoch [9/10]: 100%|██████████| 625/625 [00:05<00:00, 124.60it/s, loss=0.715]


Epoch 9/10, Train Loss: 0.6099, Val Loss: 0.6369, Accuracy: 63.94%


Training Epoch [10/10]: 100%|██████████| 2500/2500 [01:19<00:00, 31.58it/s, loss=0.521]
Validation Epoch [10/10]: 100%|██████████| 625/625 [00:05<00:00, 115.27it/s, loss=0.713]

Epoch 10/10, Train Loss: 0.6040, Val Loss: 0.6345, Accuracy: 63.98%





As we cna see model achieves around 64% accuracy which is a little bit better than random guessing.

In `sentiment_analysis_with_bert.ipynb` we will increase the accuracy using pretrained transformer