Import libraries needed

In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from datasets import load_dataset
import numpy as np
from collections import Counter
import re

Glove is choosen as the word embedding compared to Word2vec

"So if your task involves understanding relationships between words in short phrases or individual sentences (like real-time chat analysis), Word2Vec might shine. GloVe, on the other hand, focuses on the global context. It learns from the overall co-occurrence of words across an entire text corpus."

Source: https://medium.com/biased-algorithms/word2vec-vs-glove-which-word-embedding-model-is-right-for-you-4dfc161c3f0c#:~:text=So%20if%20your%20task%20involves,across%20an%20entire%20text%20corpus.

So can download from https://nlp.stanford.edu/projects/glove/

In [12]:
# Load the Rotten Tomatoes dataset
dataset = load_dataset("rotten_tomatoes")
train_dataset = dataset['train']
validation_dataset = dataset['validation']
test_dataset = dataset['test']

# Load GloVe embeddings
def load_glove_embeddings(file_path):
    embedding_dict = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embedding_dict[word] = vector
    return embedding_dict

glove_embeddings = load_glove_embeddings("glove.6B/glove.6B.300d.txt")

In [13]:
train_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

In [14]:
# Tokenizing the dataset to gather sentence statistics
token_counts = Counter()
for sample in train_dataset:
    tokens = tokenize(sample['text'])
    token_counts.update(tokens)

# Total number of tokens in all sentences
total_tokens = sum(token_counts.values())

# Size of the vocabulary excluding padding and unknown tokens
vocab = {word: idx + 1 for idx, (word, _) in enumerate(token_counts.items())}
vocab_size_excluding_special = len(vocab)

# Size of the vocabulary including padding and unknown tokens
vocab_size_including_special = vocab_size_excluding_special + 1  # +1 for padding token

# Print the results
print("Total Number of Tokens in All Sentences:", total_tokens)
print("Size of Vocabulary Including Padding and Unknown Tokens:", vocab_size_including_special)
print("Size of Vocabulary Excluding Special Tokens:", vocab_size_excluding_special)

Total Number of Tokens in All Sentences: 164854
Size of Vocabulary Including Padding and Unknown Tokens: 16513
Size of Vocabulary Excluding Special Tokens: 16512


In [15]:
validation_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 1066
})

In [16]:
test_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 1066
})

Question 1. Word Embedding
(a) What is the size of the vocabulary formed from your training data?

In [17]:
# Tokenize text
def tokenize(text):
    return re.findall(r'\b\w+\b', text.lower())

# Build vocabulary from the training data
word_counter = Counter()
for sample in train_dataset:
    word_counter.update(tokenize(sample['text']))

# Create word-to-index mapping
vocab = {word: idx + 1 for idx, (word, _) in enumerate(word_counter.items())}
vocab_size = len(vocab) + 1  # +1 for padding index 0

print(f"Vocabulary Size: {vocab_size}")

Vocabulary Size: 16513


Question 1. Word Embedding
(b) We use OOV (out-of-vocabulary) to refer to those words appeared in the training data but not in the Word2vec (or Glove) dictionary. How many OOV words exist in your training data?

In [18]:
# Initialize embedding matrix
embedding_dim = 300
embedding_matrix = np.zeros((vocab_size, embedding_dim))  # Row 0 is for padding (all zeros)

# Compute the average of all known word embeddings for OOV initialization
known_embeddings = np.array(list(glove_embeddings.values()))
average_embedding = np.mean(known_embeddings, axis=0)

oov_words = []

# Fill the embedding matrix
for word, idx in vocab.items():
    if word in glove_embeddings:
        embedding_matrix[idx] = glove_embeddings[word]
    else:
        oov_words.append(word)
        embedding_matrix[idx] = average_embedding  # Assign the average of known embeddings for OOV words

print(f"Number of OOV words: {len(oov_words)}")

Number of OOV words: 591


In [20]:
embedding_matrix

array([[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 4.65600006e-02,  2.13180006e-01, -7.43639981e-03, ...,
         9.06109996e-03, -2.09889993e-01,  5.39130010e-02],
       [-1.49240002e-01,  2.12440006e-02, -3.42400014e-01, ...,
         6.46799982e-01, -3.72390002e-01, -8.50550011e-02],
       ...,
       [ 2.82950014e-01,  5.94279990e-02,  1.21420003e-01, ...,
        -2.45979995e-01, -2.47429997e-01, -4.69060004e-01],
       [ 9.23821107e-02, -8.33334178e-02, -4.74294720e-05, ...,
         2.18727857e-01,  1.81843743e-01, -6.70229569e-02],
       [ 3.01910013e-01,  9.50440019e-02,  6.88820004e-01, ...,
         2.34620005e-01,  4.47530001e-02, -8.31780016e-01]])

In [21]:
known_embeddings

array([[ 0.04656  ,  0.21318  , -0.0074364, ...,  0.0090611, -0.20989  ,
         0.053913 ],
       [-0.25539  , -0.25723  ,  0.13169  , ..., -0.2329   , -0.12226  ,
         0.35499  ],
       [-0.12559  ,  0.01363  ,  0.10306  , ..., -0.34224  , -0.022394 ,
         0.13684  ],
       ...,
       [ 0.075713 , -0.040502 ,  0.18345  , ...,  0.21838  ,  0.30967  ,
         0.43761  ],
       [ 0.81451  , -0.36221  ,  0.31186  , ...,  0.075486 ,  0.28408  ,
        -0.17559  ],
       [ 0.429191 , -0.296897 ,  0.15011  , ...,  0.28975  ,  0.32618  ,
        -0.0590532]], dtype=float32)

In [22]:
average_embedding

array([ 9.23821107e-02, -8.33334178e-02, -4.74294720e-05,  1.36092737e-01,
       -1.11753214e-02, -8.99242051e-03,  8.04364085e-02, -1.01534374e-01,
       -4.50804494e-02,  6.10840023e-01, -1.13657795e-01,  3.47111840e-03,
        1.00918554e-01, -1.08997978e-01, -8.33619833e-02, -1.21095352e-01,
        8.74321386e-02, -2.68112402e-02,  2.45265719e-02,  5.65140955e-02,
        4.17313576e-02, -6.88741356e-02, -2.08641350e-01, -1.06938221e-01,
        1.64321020e-01, -1.77382231e-02, -1.67867485e-02,  2.83149779e-02,
        7.04017058e-02, -5.70689030e-02, -2.60384772e-02, -1.84562773e-01,
        9.58825573e-02, -1.21241361e-01,  4.57528085e-01, -3.04208528e-02,
       -7.29278773e-02, -1.26595302e-02,  6.19916096e-02,  3.61088440e-02,
        4.24099900e-02,  8.48450214e-02,  4.51800488e-02, -1.89534217e-01,
       -2.90697701e-02, -2.75477953e-02, -6.76741451e-02, -1.16799802e-01,
        8.04973617e-02, -7.29644075e-02, -1.84061080e-02,  8.43591914e-02,
        1.34552084e-02, -

In [24]:
vocab.items()



Question 1. Word Embedding
(c) The existence of the OOV words is one of the well-known limitations of Word2vec (or Glove). Without using any transformer-based language models (e.g., BERT, GPT, T5), what do you think is the best strategy to mitigate such limitation? Implement your solution in your source code. Show the corresponding code snippet.

=== Use the Average of All Known Word Embeddings for OOV Words ===

Instead of assigning random embeddings to OOV words, which can introduce unnecessary variance and noise, we use the average of all known word embeddings from GloVe. This helps to create an OOV embedding that is positioned centrally in the embedding space, providing a neutral starting point for OOV words. This approach helps mitigate the randomness associated with assigning random vectors and gives the model a consistent, generic representation for words that it has not seen before. The average embedding serves as a "generic" word vector, ensuring that all OOV words start with the same meaningful representation, which makes it easier for the model to learn.

=== Benefits of This Strategy ===

-> Reduced Noise: Assigning a common, meaningful embedding to all OOV words reduces noise compared to using completely random vectors. This helps the model converge faster and achieve better generalization.

-> Stability in Learning: Since the average embedding is calculated from known word vectors, it maintains contextual relationships that make OOV words less disruptive during model training.

-> Simplified Representation: By representing all OOV words with the same embedding, the model can better handle sentences with rare or unseen words without overfitting to random vectors.

Question 2. RNN
(a) Report the final configuration of your best model, namely the number of training epochs, learning rate, optimizer, batch size.

Number of Training Epochs: The model was trained for 20 epochs, but training was stopped early if there was no improvement in the validation accuracy for 3 consecutive epochs (using early stopping).

Learning Rate: 0.001

Optimizer: Adam optimizer was used, which is well-suited for faster convergence.

Batch Size: 64

In [19]:
# Define the Sentiment Dataset
class SentimentDataset(Dataset):
    def __init__(self, dataset, vocab):
        self.sentences = [tokenize_sentence(sample['text'], vocab) for sample in dataset]
        self.labels = [sample['label'] for sample in dataset]

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        return torch.tensor(self.sentences[idx]), torch.tensor(self.labels[idx])

# Tokenizing the dataset and ensuring valid index range
def tokenize_sentence(sentence, vocab):
    tokens = tokenize(sentence)
    return [vocab.get(token, 0) for token in tokens]  # Replace OOV words with index 0

# Creating datasets and dataloaders
batch_size = 64

train_data = SentimentDataset(train_dataset, vocab)
val_data = SentimentDataset(validation_dataset, vocab)
test_data = SentimentDataset(test_dataset, vocab)

# Padding function for data loader
def collate_fn(batch):
    sentences, labels = zip(*batch)
    lengths = torch.tensor([len(s) for s in sentences])
    sentences_padded = pad_sequence(sentences, batch_first=True, padding_value=0)
    labels = torch.tensor(labels)
    return sentences_padded, labels, lengths

train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_data, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

# Define the RNN model with average pooling and batch normalization
class RNNModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, embedding_matrix):
        super(RNNModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.embedding.weight.data.copy_(torch.tensor(embedding_matrix, dtype=torch.float))
        self.embedding.weight.requires_grad = False  # Freeze embeddings

        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.dropout = nn.Dropout(0.5)  # Dropout layer to reduce overfitting
        self.batch_norm = nn.BatchNorm1d(hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x, lengths):
        embedded = self.embedding(x)  # [batch_size, seq_len, embedding_dim]
        packed_embedded = pack_padded_sequence(embedded, lengths, batch_first=True, enforce_sorted=False)
        packed_rnn_out, _ = self.rnn(packed_embedded)
        rnn_out, _ = pad_packed_sequence(packed_rnn_out, batch_first=True)

        # Apply average pooling across the sequence length dimension
        batch_size, max_len, hidden_dim = rnn_out.size()
        mask = torch.arange(max_len).expand(batch_size, max_len) < lengths.unsqueeze(1)
        rnn_out = rnn_out * mask.unsqueeze(2)  # Mask out padding tokens
        sentence_representation = rnn_out.sum(dim=1) / lengths.unsqueeze(1).float()

        sentence_representation = self.batch_norm(sentence_representation)
        sentence_representation = self.dropout(sentence_representation)
        output = self.fc(sentence_representation)
        return output

# Model parameters
hidden_dim = 256
output_dim = 2  # Sentiment (positive or negative)

# Initialize the model
model = RNNModel(vocab_size, embedding_dim, hidden_dim, output_dim, embedding_matrix)

# Training parameters
epochs = 20
learning_rate = 0.001
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# Training loop with gradient clipping and early stopping
best_val_acc = 0
patience = 3  # Stop training if validation accuracy does not improve for 3 consecutive epochs
epochs_no_improve = 0

for epoch in range(epochs):
    # Training phase
    model.train()
    train_loss = 0.0
    correct = 0
    total = 0

    for sentences, labels, lengths in train_loader:
        optimizer.zero_grad()  # Zero out gradients
        output = model(sentences, lengths)

        # Calculate loss and backpropagate
        loss = criterion(output, labels)
        loss.backward()

        # Apply gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)

        optimizer.step()

        train_loss += loss.item()

        # Calculate accuracy during training
        _, predicted = torch.max(output, 1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

    train_acc = correct / total

    # Validation phase
    model.eval()
    correct_val = 0
    total_val = 0
    with torch.no_grad():
        for sentences, labels, lengths in val_loader:
            output = model(sentences, lengths)
            _, predicted = torch.max(output, 1)
            total_val += labels.size(0)
            correct_val += (predicted == labels).sum().item()

    val_acc = correct_val / total_val

    # Output training and validation results for this epoch
    print(f"Epoch [{epoch+1}/{epochs}], Train Loss: {train_loss:.4f}, Train Accuracy: {train_acc:.4f}, Validation Accuracy: {val_acc:.4f}")

    # Early stopping if validation accuracy is not improving
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        epochs_no_improve = 0  # Reset the counter if the validation accuracy improves
    else:
        epochs_no_improve += 1

    if epochs_no_improve == patience:
        print(f"Early stopping after epoch {epoch+1}")
        break

# Evaluation on Test Set
model.eval()
correct_test = 0
total_test = 0
with torch.no_grad():
    for sentences, labels, lengths in test_loader:
        output = model(sentences, lengths)
        _, predicted = torch.max(output, 1)
        total_test += labels.size(0)
        correct_test += (predicted == labels).sum().item()

test_acc = correct_test / total_test
print(f"Test Accuracy: {test_acc:.4f}")

Epoch [1/20], Train Loss: 77.3435, Train Accuracy: 0.7103, Validation Accuracy: 0.6932
Epoch [2/20], Train Loss: 69.9883, Train Accuracy: 0.7436, Validation Accuracy: 0.7308
Epoch [3/20], Train Loss: 68.4679, Train Accuracy: 0.7510, Validation Accuracy: 0.7073
Epoch [4/20], Train Loss: 65.2276, Train Accuracy: 0.7655, Validation Accuracy: 0.7477
Epoch [5/20], Train Loss: 64.4758, Train Accuracy: 0.7683, Validation Accuracy: 0.7411
Epoch [6/20], Train Loss: 61.4413, Train Accuracy: 0.7777, Validation Accuracy: 0.7017
Epoch [7/20], Train Loss: 61.1289, Train Accuracy: 0.7826, Validation Accuracy: 0.7223
Early stopping after epoch 7
Test Accuracy: 0.7280


In [25]:
correct_test

776

In [26]:
total_test

1066

Question 2. RNN (b) Report the accuracy score on the test set, as well as the accuracy score on the validation set for each epoch during training

^ This is answered from the above.

Question 2. RNN (c) RNNs produce a hidden vector for each word, instead of the entire sentence. Which methods have you tried in deriving the final sentence representation to perform sentiment classification? Describe all the strategies you have implemented, together with their accuracy scores on the test set.

=== Average Pooling over Hidden States ===

Description: After running the RNN on the embedded sentence, instead of just using the hidden state of the last word (which can cause information loss), average pooling was applied across all the hidden states. This means taking the mean of all the hidden states produced at each time step of the RNN, which gives a more comprehensive representation of the entire sequence.

Implementation

batch_size, max_len, hidden_dim = rnn_out.size()
mask = torch.arange(max_len).expand(batch_size, max_len) < lengths.unsqueeze(1)
rnn_out = rnn_out * mask.unsqueeze(2)  # Mask out padding tokens
sentence_representation = rnn_out.sum(dim=1) / lengths.unsqueeze(1).float()

Purpose: The use of average pooling helps ensure that the model captures information from all parts of the sentence, which is particularly helpful for long sequences where relying solely on the last hidden state may lead to loss of important information.

=== Batch Normalization and Dropout ===

Batch Normalization was applied after pooling to help stabilize learning by normalizing the hidden state values. This reduces internal covariate shift, making training faster and more stable.
Dropout with a rate of 0.5 was used to prevent overfitting by randomly setting a portion of the neurons to zero during training, which helps the model generalize better.

Implementation

sentence_representation = self.batch_norm(sentence_representation)
sentence_representation = self.dropout(sentence_representation)

Average Pooling: Instead of using the final hidden state, the average of all hidden states was used to form the final sentence representation.

Batch Normalization: Used to stabilize learning and improve generalization.

Dropout: Applied to reduce overfitting and help the model generalize to unseen data.