# Tutorial 2 - embeddings

Authors: Michał Gromadzki, Kacper Skonieczka

TASK: sentiment analysis classification

This project performs sentiment analysis on the IMDb movie reviews dataset using a transformer-based model, DistilBERT. DistilBERT is a smaller, faster variant of BERT and is well-suited for text classification tasks like sentiment analysis. We will preprocess the data, train the model, and evaluate its performance using common classification metrics such as F1 Score, Precision, Recall, and Accuracy.

In [None]:
import torch
import torch.nn as nn
import pandas as pd
import re
import nltk
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Data preparation

We load and preprocess the IMDb dataset from the Hugging Face library. The dataset is split into training and testing sets. Tokenization is done using the DistilBERTTokenizer, which converts text into a format suitable for the model.

In [None]:
splits = {'train': 'plain_text/train-00000-of-00001.parquet', 'test': 'plain_text/test-00000-of-00001.parquet', 'unsupervised': 'plain_text/unsupervised-00000-of-00001.parquet'}
df_train = pd.read_parquet("hf://datasets/stanfordnlp/imdb/" + splits["train"])
df_test = pd.read_parquet("hf://datasets/stanfordnlp/imdb/" + splits["test"])

In [None]:
def clean_text_pipeline(text):
    # Lowercase the text
    text = text.lower()

    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)

    # Remove special characters and punctuation
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove stopwords
    stop_words = set(nltk.corpus.stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

In [None]:
df_train["text_cleaned"] = df_train["text"].apply(clean_text_pipeline)
df_test["text_cleaned"] = df_test["text"].apply(clean_text_pipeline)

In [None]:
class MovieReviewDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
        assert len(texts) == len(labels), "Text and labels must be the same length"

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        # Tokenize the text with the tokenizer and convert to tensors
        encoded_input = self.tokenizer(
            text,
            padding='max_length',
            max_length=self.max_length,
            truncation=True,
            return_tensors='pt'
        )

        input_ids = encoded_input['input_ids'].squeeze(0)  # Squeeze to remove batch dimension
        attention_mask = encoded_input['attention_mask'].squeeze(0)

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask
        }, label

In [None]:
from transformers import DistilBertTokenizer, DistilBertModel
distilbert_model = DistilBertModel.from_pretrained("distilbert-base-uncased")
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

for param in distilbert_model.parameters():
    param.requires_grad = False

In [None]:
max_length = 512  # This is induced by the chosen model, for Distilbert max sequence length is 512, this means that some reviews will be truncated to 512 tokens
batch_size = 128

# Create dataset
dataset_train = MovieReviewDataset(df_train["text_cleaned"].tolist(), df_train["label"].tolist(), tokenizer, max_length)
dataset_test = MovieReviewDataset(df_test["text_cleaned"].tolist(), df_test["label"].tolist(), tokenizer, max_length)

# Create DataLoader
dataloader_train = DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
dataloader_test = DataLoader(dataset_test, batch_size=batch_size, shuffle=False)

## Model Definition: DistilBERT

Here, we define and load the DistilBERT model using the Hugging Face transformers library. We freeze the model's parameters to prevent them from being updated during training and focus on training only the classification head.


In [None]:
# Simple classification model built on top of DistilBERT as feature extractor
class DistilBERTClassifier(nn.Module):
    def __init__(self, distilbert_model, num_labels):
        super(DistilBERTClassifier, self).__init__()
        self.distilbert = distilbert_model
        self.pre_classifier = nn.Linear(self.distilbert.config.hidden_size, 32)
        self.classifier = nn.Linear(32, num_labels)

    def forward(self, input_ids, attention_mask):
        # Extract features from DistilBERT
        outputs = self.distilbert(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = outputs.last_hidden_state  # (batch_size, seq_len, hidden_size)
        x = hidden_state[:, 0]  # We take the first token's output (CLS token)
        x = self.pre_classifier(x)
        x = nn.ReLU()(x)
        logits = self.classifier(x)
        return logits

num_labels = 2
model = DistilBERTClassifier(distilbert_model, num_labels)

In [None]:
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=1e-4)
loss_fn = nn.CrossEntropyLoss()

In [None]:
EPOCHS = 10
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Model Training
### Training on prepared text
- Train DistilBERT for 10 epochs.
- Track and log performance metrics during training.

In [None]:
model = model.to(device)

for epoch in range(EPOCHS):
    model.train()
    loss_total = 0
    cnt = 0
    for batch in tqdm(dataloader_train):
        x, y = batch
        input_ids = x['input_ids'].to(device)
        attention_mask = x['attention_mask'].to(device)
        labels = y.to(device)

        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_fn(outputs, labels)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        loss_total += loss.item()
        cnt += 1

    model.eval()
    preds = []
    for batch in tqdm(dataloader_test):
        x, y = batch
        input_ids = x['input_ids'].to(device)
        attention_mask = x['attention_mask'].to(device)
        labels = y.to(device)

        # Forward pass
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        _, predicted = torch.max(outputs, 1)

        preds.extend(predicted.cpu().numpy())

    acc = accuracy_score(df_test['label'], preds)
    f1 = f1_score(df_test['label'], preds)
    precision = precision_score(df_test['label'], preds)
    recall = recall_score(df_test['label'], preds)

    print(f"Epoch {epoch + 1}/{EPOCHS}, Loss: {loss_total / cnt}, Accuracy: {acc}, F1: {f1}, Precision: {precision}, Recall: {recall}")

100%|██████████| 196/196 [06:17<00:00,  1.92s/it]
100%|██████████| 196/196 [06:05<00:00,  1.86s/it]


Epoch 1/10, Loss: 0.641500647883026, Accuracy: 0.75356, F1: 0.757316736912593, Precision: 0.7459455264995732, Recall: 0.76904


100%|██████████| 196/196 [06:18<00:00,  1.93s/it]
100%|██████████| 196/196 [06:05<00:00,  1.87s/it]


Epoch 2/10, Loss: 0.5465788521936962, Accuracy: 0.77864, F1: 0.7796799108209251, Precision: 0.776034236804565, Recall: 0.78336


100%|██████████| 196/196 [06:18<00:00,  1.93s/it]
100%|██████████| 196/196 [06:05<00:00,  1.86s/it]


Epoch 3/10, Loss: 0.4954750443599662, Accuracy: 0.7878, F1: 0.7804130965685667, Precision: 0.8085599107985247, Recall: 0.75416


100%|██████████| 196/196 [06:18<00:00,  1.93s/it]
100%|██████████| 196/196 [06:06<00:00,  1.87s/it]


Epoch 4/10, Loss: 0.47062053865924175, Accuracy: 0.79424, F1: 0.7930313028084011, Precision: 0.7977173385138416, Recall: 0.7884


100%|██████████| 196/196 [06:19<00:00,  1.93s/it]
100%|██████████| 196/196 [06:05<00:00,  1.87s/it]


Epoch 5/10, Loss: 0.45542559423008744, Accuracy: 0.79744, F1: 0.8009903324687573, Precision: 0.7871929553530048, Recall: 0.81528


100%|██████████| 196/196 [06:19<00:00,  1.94s/it]
100%|██████████| 196/196 [06:05<00:00,  1.87s/it]


Epoch 6/10, Loss: 0.44665515468436845, Accuracy: 0.80316, F1: 0.800535041141421, Precision: 0.8113548599129077, Recall: 0.79


100%|██████████| 196/196 [06:17<00:00,  1.93s/it]
100%|██████████| 196/196 [06:04<00:00,  1.86s/it]


Epoch 7/10, Loss: 0.441218692277159, Accuracy: 0.80416, F1: 0.7984521653219167, Precision: 0.8224219810040706, Recall: 0.77584


100%|██████████| 196/196 [06:18<00:00,  1.93s/it]
100%|██████████| 196/196 [06:06<00:00,  1.87s/it]


Epoch 8/10, Loss: 0.43691897377067684, Accuracy: 0.8048, F1: 0.7958329846874739, Precision: 0.8341519031748816, Recall: 0.76088


100%|██████████| 196/196 [06:18<00:00,  1.93s/it]
100%|██████████| 196/196 [06:06<00:00,  1.87s/it]


Epoch 9/10, Loss: 0.4336636109011514, Accuracy: 0.8084, F1: 0.8108961705487564, Precision: 0.8004676539360873, Recall: 0.8216


100%|██████████| 196/196 [06:19<00:00,  1.94s/it]
100%|██████████| 196/196 [06:04<00:00,  1.86s/it]

Epoch 10/10, Loss: 0.4310526201615528, Accuracy: 0.80936, F1: 0.8055328872204995, Precision: 0.8220353097934711, Recall: 0.78968





### Training on the base text, with processing

In [None]:
# Create dataset
dataset_train = MovieReviewDataset(df_train["text"].tolist(), df_train["label"].tolist(), tokenizer, max_length)
dataset_test = MovieReviewDataset(df_test["text"].tolist(), df_test["label"].tolist(), tokenizer, max_length)

# Create DataLoader
dataloader_train = DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
dataloader_test = DataLoader(dataset_test, batch_size=batch_size, shuffle=False)

In [None]:
num_labels = 2
model = DistilBERTClassifier(distilbert_model, num_labels)

In [None]:
optimizer = AdamW(model.parameters(), lr=1e-4)
loss_fn = nn.CrossEntropyLoss()

In [None]:
EPOCHS = 10
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [None]:
model = model.to(device)

for epoch in range(EPOCHS):
    model.train()
    loss_total = 0
    cnt = 0
    for batch in tqdm(dataloader_train):
        x, y = batch
        input_ids = x['input_ids'].to(device)
        attention_mask = x['attention_mask'].to(device)
        labels = y.to(device)

        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_fn(outputs, labels)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        loss_total += loss.item()
        cnt += 1

    model.eval()
    preds = []
    for batch in tqdm(dataloader_test):
        x, y = batch
        input_ids = x['input_ids'].to(device)
        attention_mask = x['attention_mask'].to(device)
        labels = y.to(device)

        # Forward pass
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        _, predicted = torch.max(outputs, 1)

        preds.extend(predicted.cpu().numpy())

    acc = accuracy_score(df_test['label'], preds)
    f1 = f1_score(df_test['label'], preds)
    precision = precision_score(df_test['label'], preds)
    recall = recall_score(df_test['label'], preds)

    print(f"Epoch {epoch + 1}/{EPOCHS}, Loss: {loss_total / cnt}, Accuracy: {acc}, F1: {f1}, Precision: {precision}, Recall: {recall}")

100%|██████████| 196/196 [07:39<00:00,  2.34s/it]
100%|██████████| 196/196 [07:24<00:00,  2.27s/it]


Epoch 1/10, Loss: 0.6359314924600173, Accuracy: 0.80256, F1: 0.805347424875779, Precision: 0.7941359464924561, Recall: 0.81688


100%|██████████| 196/196 [07:38<00:00,  2.34s/it]
100%|██████████| 196/196 [07:23<00:00,  2.26s/it]


Epoch 2/10, Loss: 0.5132269103612218, Accuracy: 0.81948, F1: 0.8170133398207842, Precision: 0.8283318260297624, Recall: 0.806


100%|██████████| 196/196 [07:38<00:00,  2.34s/it]
100%|██████████| 196/196 [07:24<00:00,  2.27s/it]


Epoch 3/10, Loss: 0.4375992350432338, Accuracy: 0.83184, F1: 0.8326166587036152, Precision: 0.8287888395688016, Recall: 0.83648


100%|██████████| 196/196 [07:39<00:00,  2.34s/it]
100%|██████████| 196/196 [07:23<00:00,  2.26s/it]


Epoch 4/10, Loss: 0.40034028857338183, Accuracy: 0.83936, F1: 0.8383382980436357, Precision: 0.8437044239183277, Recall: 0.83304


100%|██████████| 196/196 [07:38<00:00,  2.34s/it]
100%|██████████| 196/196 [07:24<00:00,  2.27s/it]


Epoch 5/10, Loss: 0.3825528227857181, Accuracy: 0.84336, F1: 0.8468637572344752, Precision: 0.8283353733170135, Recall: 0.86624


100%|██████████| 196/196 [07:38<00:00,  2.34s/it]
100%|██████████| 196/196 [07:23<00:00,  2.26s/it]


Epoch 6/10, Loss: 0.37013255181361215, Accuracy: 0.8476, F1: 0.8506585136406396, Precision: 0.8339225330464187, Recall: 0.86808


100%|██████████| 196/196 [07:37<00:00,  2.34s/it]
100%|██████████| 196/196 [07:22<00:00,  2.26s/it]


Epoch 7/10, Loss: 0.3625328825140486, Accuracy: 0.85064, F1: 0.8533731249509149, Precision: 0.8380379453956501, Recall: 0.86928


100%|██████████| 196/196 [07:38<00:00,  2.34s/it]
100%|██████████| 196/196 [07:24<00:00,  2.27s/it]


Epoch 8/10, Loss: 0.3572563136718711, Accuracy: 0.8552, F1: 0.8551304626220586, Precision: 0.8555413196668802, Recall: 0.85472


100%|██████████| 196/196 [07:40<00:00,  2.35s/it]
100%|██████████| 196/196 [07:24<00:00,  2.27s/it]


Epoch 9/10, Loss: 0.35155866583999323, Accuracy: 0.85748, F1: 0.8556496374022606, Precision: 0.8667815808914061, Recall: 0.8448


100%|██████████| 196/196 [07:38<00:00,  2.34s/it]
100%|██████████| 196/196 [07:23<00:00,  2.26s/it]

Epoch 10/10, Loss: 0.348458391671278, Accuracy: 0.85648, F1: 0.8584503708379361, Precision: 0.8468244084682441, Recall: 0.8704





## Summary

We use DistilBERT, a smaller version of BERT, to classify movie reviews. Only the classification head is trained while the DistilBERT parameters are frozen.

Data-Processing Pipeline:
- Tokenization: Text is tokenized using DistilBertTokenizer into input IDs and attention masks.
- DataLoader: Data is batched using DataLoaders for efficient training.
- Training: Only the classification layer is trained with cross-entropy loss and AdamW optimizer.

### Metrics on test setL


|            | Accuracy | F1 Score | Precision | Recall |
|------------|----------|----------|-----------|--------|
| **With preprocessing**  | 0.8084   | 0.8109   | 0.8005    | 0.8216 |
| **Without preprocessing**   | 0.8575   | 0.8556   | 0.8668    | 0.8448 |


Surprisingly, the model performed worse on the processed text compared to the unprocessed version. We believe this is due to how the DistilBERT model was trained—on complete sentences that included numbers and other elements. As a result, processing the text may have altered it too much from the representations the model learned during training.