# Fine-Tuning RoBERTa for Sentiment Analysis

&nbsp;

## 1. Introduction
This is a university project where I am tasked with fine-tuning [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta), a pretrained transformer model, to classify the text data examples provided into conveyers of positive emotions or conveyers of negative emotions.

&nbsp;

### 1.1. Objectives
Achieve an F$_1$-score > 0.77 on the test predictions.

&nbsp;

### 1.2. Source of the Data
Datasets provided by the University.

&nbsp;

### 1.3. Datasets Description
The folder `data_roberta/train/` contains 12100 text files, which represent the training examples. The `data_roberta/train/labels.csv` file contains the respective training labels.

As for the folder `data_roberta/test/`, it contains 2000 text files corresponding to the test set. Unfortunately, the test set labels were not provided to me, having the final test F$_1$-score been evaluated on an online university submission platform.

No more information on the data was given.

&nbsp;

## 2. Importing Libraries and Loading Data

In [None]:
%%capture
# Numpy version had to be downgraded due to incompatibilities with the Datasets library from Hugging Face
!pip install -U datasets
!pip install numpy<2.0

# Unzipping data folder
!unzip data_roberta.zip

In [None]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification, get_linear_schedule_with_warmup
from datasets import Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader
from torch.nn import CrossEntropyLoss
from torch.optim import AdamW
import os

In [None]:
np.__version__

'1.26.4'

In [None]:
# Load labeled list of training files:
train_files = pd.read_csv('data_roberta/train/labels.csv', index_col=0)
train_files['file'] = ['data_roberta/train/' + s for s in train_files['file']]
print(f'# of positive samples: {(train_files.label == 0).sum():d}')
print(f'# of negative samples: {(train_files.label == 1).sum():d}')
train_files.head()

# of positive samples: 8460
# of negative samples: 3640


Unnamed: 0,file,label
0,data_roberta/train/0000.txt,0
1,data_roberta/train/0001.txt,0
2,data_roberta/train/0002.txt,1
3,data_roberta/train/0003.txt,0
4,data_roberta/train/0004.txt,1


Here we can see that there is a clear class imbalance in the training set.

Therefore, I will have the need to tackle this issue by, for instance, defining later on a custom loss function for the model to penalize wrong minority class predictions.

In [None]:
# Load random training data example:
sample_row = train_files.sample(1, random_state=0).iloc[0]

with open(sample_row.file, 'r') as f:
    print(f'Random training ex.: {f.read()}')
print(f'Corresponding label: {sample_row.label}')

Random training ex.: @user thank you for signing Shelby’s poster last night! She is still so excited. We are so proud of our big win! #OneTROY #GoTrojans #ProudAlum 
Corresponding label: 0


In [None]:
# Load list of test files:
test_files = ['data_roberta/test/' + s for s in os.listdir('data_roberta/test/')]
test_files.sort()
test_files = pd.DataFrame({'file': test_files})
print(f'# of testing examples: {len(test_files)}')
test_files.head()

# of testing examples: 2000


Unnamed: 0,file
0,data_roberta/test/0000.txt
1,data_roberta/test/0001.txt
2,data_roberta/test/0002.txt
3,data_roberta/test/0003.txt
4,data_roberta/test/0004.txt


&nbsp;

## 3. Creating Validation Dataset

In [None]:
# Creating a small stratified validation files dataset
train_files, val_files = train_test_split(
    train_files,
    test_size=0.05,
    stratify=train_files['label'],
    random_state=7
)

&nbsp;

## 4. Loading RoBERTa and its Tokenizer

In [None]:
# Setting device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [None]:
# Loading pretrained RoBERTa base with Classification head
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2)
model = model.to(device)
model.eval()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         

&nbsp;

## 5. Defining the Loss Function

In [None]:
# Defining custom loss function to penalize wrong minority class predictions

# Class weights according to formula w_j = n_samples / (n_classes * n_samples_j)
pos_weight = len(train_files) / (2 * (train_files['label'] == 1).sum())
neg_weight = len(train_files) / (2 * (train_files['label'] == 0).sum())
weights = torch.tensor([neg_weight, pos_weight], dtype=torch.float32, device=device)

loss_fn = CrossEntropyLoss(weight=weights)

&nbsp;

## 6. Preparing the Data

In [None]:
# Tokenizing and preparing the datasets for the DataLoaders
def tokenize_file(example):
    with open(example['file'], 'r') as f:
        text = f.read()
    return tokenizer(text, padding='max_length', truncation=True)

train_data = Dataset.from_pandas(train_files)
val_data = Dataset.from_pandas(val_files)
test_data = Dataset.from_pandas(test_files)

train_data = train_data.map(tokenize_file)
val_data = val_data.map(tokenize_file)
test_data = test_data.map(tokenize_file)

train_data = train_data.remove_columns(['file'])
val_data = val_data.remove_columns(['file'])
test_data = test_data.remove_columns(['file'])

train_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
val_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_data.set_format('torch', columns=['input_ids', 'attention_mask'])

In [None]:
# Creating DataLoaders
train_dataloader = DataLoader(train_data, batch_size=8, shuffle=True, num_workers=2)
val_dataloader = DataLoader(val_data, batch_size=8, shuffle=False, num_workers=2)
test_dataloader = DataLoader(test_data, batch_size=8, shuffle=False, num_workers=2)

&nbsp;

## 7. Defining Model Hyperparameters

In [None]:
# Number of epochs
nr_epochs = 3

# Optimizer
lr = 3e-5
optimizer = AdamW(model.parameters(), lr=lr)

# Scheduler
nr_train_steps = len(train_dataloader) * nr_epochs
nr_warmup_steps = int(0.1 * nr_train_steps)  # 10% warmup

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=nr_warmup_steps,
    num_training_steps=nr_train_steps
)

&nbsp;

## 8. Training and Evaluation

In [None]:
# Training
loss_train = []

for epoch in range(nr_epochs):

  model.train()
  loss_train.append([])
  for batch in train_dataloader:
    optimizer.zero_grad()

    outputs = model(
      input_ids      = batch['input_ids'].to(device),
      attention_mask = batch['attention_mask'].to(device)
    )

    loss = loss_fn(outputs.logits, batch['label'].to(device))
    loss.backward()
    loss_train[-1].append(loss.item())

    optimizer.step()
    scheduler.step()

  loss_train[-1] = np.mean(loss_train[-1])

  print(f"Epoch {epoch+1}: loss = {loss_train[-1]:.3f}")

Epoch 1: loss = 0.565
Epoch 2: loss = 0.446
Epoch 3: loss = 0.362


In [None]:
# Evaluating
model.eval()
y_true, y_pred = [], []

for batch in val_dataloader:
  with torch.no_grad():
    outputs = model(
      input_ids      = batch['input_ids'].to(device),
      attention_mask = batch['attention_mask'].to(device)
    )

  y_pred.extend(outputs.logits.argmax(dim=1).cpu().numpy())
  y_true.extend(batch['label'].cpu().numpy())

f1_score(y_true, y_pred, average='macro')

0.7506936622620257

In [None]:
# Training for 1 more epoch
loss_train = []

for epoch in range(1):

  model.train()
  loss_train.append([])
  for batch in train_dataloader:
    optimizer.zero_grad()

    outputs = model(
      input_ids      = batch['input_ids'].to(device),
      attention_mask = batch['attention_mask'].to(device)
    )

    loss = loss_fn(outputs.logits, batch['label'].to(device))
    loss.backward()
    loss_train[-1].append(loss.item())

    optimizer.step()
    scheduler.step()

  loss_train[-1] = np.mean(loss_train[-1])

  print(f"Epoch {epoch+4}: loss = {loss_train[-1]:.3f}")

Epoch 4: loss = 0.323


In [None]:
# Evaluating after the 1 extra training epoch
model.eval()
y_true, y_pred = [], []

for batch in val_dataloader:
  with torch.no_grad():
    outputs = model(
      input_ids      = batch['input_ids'].to(device),
      attention_mask = batch['attention_mask'].to(device)
    )

  y_pred.extend(outputs.logits.argmax(dim=1).cpu().numpy())
  y_true.extend(batch['label'].cpu().numpy())

f1_score(y_true, y_pred, average='macro')

0.7506936622620257

&nbsp;

## 9. Predictions

In [None]:
# Predicting
model.eval()
y_pred = []

for batch in test_dataloader:
  with torch.no_grad():
    outputs = model(
      input_ids      = batch['input_ids'].to(device),
      attention_mask = batch['attention_mask'].to(device)
    )

  y_pred.extend(outputs.logits.argmax(dim=1).cpu().numpy())

In [None]:
predictions = y_pred

pd.DataFrame(predictions, columns=['predictions']).to_csv('submission_roberta.csv')

&nbsp;

## 10. Conclusion

When fine-tuning a pre-trained model, one needs to be very careful with the number of epochs, since training the model for too many epochs will basically make it "forget" all the pre-training it had.

In this project I only tried to train the model for a single extra epoch (the 4th one) to see if the F$_1$-score would be improved in the validation set. Although it did not improve, the model ended up achieving an F$_1$-score > 0.77 in the test dataset only after this extra training epoch (checked by submitting the predictions of the model on the online university submission platform and successfully passing the test only with the 4-epochs training).