## CZ4042: Neural Networks and Deep Learning Group Project
Chua Yong Xuan, Daniel Yang, Jiang Zixing

## C. Sentiment Analysis in Text

## Sentiment Analysis with BERT

In this notebook, we aim to perform sentiment analysis on text data using the BERT (Bidirectional Encoder Representations from Transformers) model, a state-of-the-art transformer-based architecture that has shown remarkable performance in various natural language processing tasks.

We have already preprocessed our data in the data preprocessing notebook and split it into three datasets: training, validation, and test. Our focus here will be on fine-tuning a BERT model on our training dataset and evaluating its performance on the validation and test sets.

One of the key aspects we want to explore is the impact of hyperparameter tuning on the performance of our model. For this purpose, we will perform hyperparameter tuning based on the range of values recommended by the authors of the original BERT paper, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding."

In addition to our original training dataset, we have created augmented training datasets using techniques like Google Translate, the `nlpaug` library, and GPT-2. We are curious to see how these augmented datasets affect the training and performance of our model. Will they enhance the model's ability to generalize and improve its performance, or will they have an adverse effect due to potential issues with data quality? Let's find out!

Installing dependencies and importing necessary libraries

In [None]:
!pip install transformers
!pip install optuna

import random
import numpy as np
import torch
import pandas as pd
import optuna
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, GPT2ForSequenceClassification, GPT2Tokenizer, XLNetForSequenceClassification, XLNetTokenizer
from sklearn.metrics import accuracy_score, f1_score



In order to ensure reproducibility of our results, we are setting the random seeds for different libraries we are using in our notebook. This makes the random functions in these libraries generate the same values every time the notebook is run.

In [None]:
# Seed Python random
random_seed = 42
random.seed(random_seed)

# Seed NumPy random
np.random.seed(random_seed)

# Seed PyTorch random
torch.manual_seed(random_seed)

# Seed the CUDA random number generator
torch.cuda.manual_seed(random_seed)
torch.cuda.manual_seed_all(random_seed)

# Set the cuDNN backend to be deterministic
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Using the GPU device if it is available in our runtime.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

Importing the train, validation, and test sets created in our data preprocessing notebook.

In [None]:
# Importing our train, validation, and test set
train_df = pd.read_csv('train.csv', encoding='ISO-8859-1')
val_df = pd.read_csv('validation.csv', encoding='ISO-8859-1')
test_df = pd.read_csv('test.csv', encoding='ISO-8859-1')

# Checking the output of our imported dataset
print(train_df)
print(val_df)
print(test_df)

      sentiment                                          statement
0             1  The operations to be sold include manufacturin...
1             1  L&T has also made a commitment to redeem the r...
2             1              The deal was worth about EUR 1.2 mn .
3             1  FinancialWire tm is not a press release servic...
4             1  The share of the share capital of both above m...
...         ...                                                ...
3092          1  The Estonian beverages maker A. Le Coq today b...
3093          1  In volume , the focus is already outside Finla...
3094          1  Finland 's Technopolis is planning to bring th...
3095          1  The shares represented 4.998 % of total share ...
3096          2  The concept enables a commercially affordable ...

[3097 rows x 2 columns]
     sentiment                                          statement
0            1  It is estimated that the consolidated turnover...
1            1  The uranium found local

### Defining the SentimentDataset class
In this code block, we are setting up the necessary components for tokenizing and processing our text data in preparation for training our sentiment analysis model.

1. We first load the BERT tokenizer, which is a pre-trained tokenizer that corresponds to the BERT model architecture. This tokenizer will convert our text data into the format that BERT expects.

2. Next, we define a custom dataset class, `SentimentDataset`, which inherits from PyTorch's `Dataset` class. This dataset class will handle the loading and processing of our text data.

    - The `__init__` method takes in the dataframe containing our text data and the BERT tokenizer as arguments and stores them as instance variables.
    
    - The `__len__` method returns the length of the dataframe, which allows us to use the `len` function on instances of our dataset class.
    
    - The `__getitem__` method takes in an index and returns the tokenized text data and corresponding sentiment label at that index in the dataframe. The text data is tokenized using the BERT tokenizer, and the resulting input IDs and attention mask are returned along with the sentiment label.

In [None]:
# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Define model name
model_name = 'prajjwal1/bert-medium'

# Dataset class
class SentimentDataset(Dataset):
    def __init__(self, dataframe, tokenizer):
        self.dataframe = dataframe
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, index):
        # Get the text and sentiment from the dataframe at the specified index
        text = self.dataframe.iloc[index]['statement']
        sentiment = self.dataframe.iloc[index]['sentiment']

        # Tokenize the text
        inputs = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        # Extract the input IDs and attention mask from the tokenized inputs
        input_ids = inputs['input_ids'].squeeze(0)
        attention_mask = inputs['attention_mask'].squeeze(0)

        # Return a dictionary containing the input IDs, attention mask, and sentiment
        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'sentiment': torch.tensor(sentiment, dtype=torch.long)
        }

### Training and Evaluation Functions
We define two functions for training and evaluating our sentiment analysis model:

1. **Training Loop (`train_model`):**
    - This function takes in the training dataloader, model, optimizer and device as parameters.
    - The model is set to training mode using `model.train()`.
    - We initialize `total_loss` and `total_correct` to keep track of the cumulative loss and correct predictions during training.
    - We iterate over batches of data from the training dataloader.
        - For each batch, we zero the gradients, move the data to the specified device, and perform a forward pass through the model.
        - The loss and logits are extracted from the model's output.
        - Backward pass is performed to calculate gradients, and the optimizer steps to update the model's parameters.
        - We update the `total_loss` and `total_correct` with the loss and accuracy of the current batch.
    - Finally, we calculate the average loss and accuracy over the entire training dataset and return them.

2. **Evaluation Loop (`evaluate_model`):**
    - This function has a similar structure to the training loop but is used for evaluating the model on validation or test data.
    - The model is set to evaluation mode using `model.eval()`.
    - We use a `torch.no_grad()` context manager to disable gradient calculation, as we don't need to update the model's parameters during evaluation.
    - The rest of the process is similar to the training loop, but we don't perform backward pass or optimizer step.
    - The average loss and accuracy over the evaluation dataset are returned.


In [None]:
# Training loop
def train_model(dataloader, model, optimizer, device):
    # Set the model to training mode
    model.train()
    total_loss = 0
    total_correct = 0
    for batch in dataloader:
        optimizer.zero_grad()

        # Move data to GPU
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['sentiment'].to(device)

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        # Update loss and accuracy
        total_loss += loss.item()
        preds = torch.argmax(logits, dim=1)
        total_correct += (preds == labels).sum().item()

    # Calculate average loss and accuracy
    avg_loss = total_loss / len(dataloader)
    accuracy = total_correct / len(dataloader.dataset)
    return avg_loss, accuracy

# Evaluation loop
def evaluate_model(dataloader, model, device):
    # Set the model to evaluation mode
    model.eval()
    total_loss = 0
    total_correct = 0
    with torch.no_grad():
        for batch in dataloader:
            # Move data to the specified device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['sentiment'].to(device)

            # Forward pass
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits

            # Update loss and accuracy
            total_loss += loss.item()
            preds = torch.argmax(logits, dim=1)
            total_correct += (preds == labels).sum().item()

    # Calculate average loss and accuracy
    avg_loss = total_loss / len(dataloader)
    accuracy = total_correct / len(dataloader.dataset)
    return avg_loss, accuracy

### Hyperparameter Tuning
We perform hyperparameter tuning of the following hyperparameters: `batch_size`, `learning_rate`, and `num_epochs` with the search space as specified in the original paper on BERT. We use the Optuna library to find the best set of hyperparameters that minimizes the validation loss of our sentiment analysis model.

We will use the set of hyperparameters found using the study to build our models.

In [None]:
def objective(trial):
    # Define hyperparameter search space as stated in the original BERT paper
    batch_size = trial.suggest_categorical('batch_size', [16, 32])
    learning_rate = trial.suggest_categorical('learning_rate', [5e-5, 3e-5, 2e-5])
    num_epochs = trial.suggest_categorical('num_epochs', [2, 3, 4])

    # Create dataset and dataloader for both train and val df
    train_dataset = SentimentDataset(train_df, tokenizer)
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    val_dataset = SentimentDataset(val_df, tokenizer)
    val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

    # Load model
    model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3).to(device)

    # Define optimizer
    optimizer = AdamW(model.parameters(), lr=learning_rate)

    # Train and evaluate the model
    for epoch in range(num_epochs):
        train_loss, train_accuracy = train_model(train_dataloader, model, optimizer, device)
        val_loss, val_accuracy = evaluate_model(val_dataloader, model, device)

    return val_loss  # Optimize for minimum validation loss

# Optimize the objective function
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=10, gc_after_trial=True) # We cap the number of trials to 10

# Print the best hyperparameters
print('Best trial:')
trial = study.best_trial
print(f'Value: {trial.value}')
print('Params: ')
for key, value in trial.params.items():
    print(f'    {key}: {value}')

[I 2023-10-30 15:11:21,667] A new study created in memory with name: no-name-50619def-215f-4ba3-86f3-abcd6adda076


Downloading (…)lve/main/config.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/167M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-medium and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2023-10-30 15:15:12,039] Trial 0 finished with value: 0.4178488850593567 and parameters: {'batch_size': 32, 'learning_rate': 3e-05, 'num_epochs': 2}. Best is trial 0 with value: 0.4178488850593567.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-medium and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2023-10-30 15:22:19,602] Trial 1 finished with value: 0.6537858664989471 and parameters: {'batch_size': 32, 'learning_rate': 5e-05, 'num_epochs': 4}. Best is trial 0 with value: 0.4178488850593567.
Some

Best trial:
Value: 0.4178488850593567
Params: 
    batch_size: 32
    learning_rate: 3e-05
    num_epochs: 2


### Defining Hyperparameters
Using the optimal hyperparameter values found in the search above, we define the `batch_size`, `learning_rate`, and `num_epochs` based on the values used in the best trial.

In [None]:
batch_size = trial.params['batch_size']
learning_rate = trial.params['learning_rate']
num_epochs = trial.params['num_epochs']

### Training the Model
We train the model using the hyperparameters found above.

In [None]:
def get_trained_model(train_df, val_df):
  model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3).to(device)

  # Define optimizer
  optimizer = AdamW(model.parameters(), lr=learning_rate)

  # Create dataset and data loader
  train_dataset = SentimentDataset(train_df, tokenizer)
  train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

  val_dataset = SentimentDataset(val_df, tokenizer)
  val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

  # Train and evaluate the model
  for epoch in range(num_epochs):
      train_loss, train_accuracy = train_model(train_dataloader, model, optimizer, device)
      val_loss, val_accuracy = evaluate_model(val_dataloader, model, device)
      print(f'Epoch {epoch + 1}/{num_epochs} | Train Loss: {train_loss:.4f} | Train Accuracy: {train_accuracy:.4f} | Val Loss: {val_loss:.4f} | Val Accuracy: {val_accuracy:.4f}')

  return model
model = get_trained_model(train_df, val_df)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-medium and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2 | Train Loss: 0.6702 | Train Accuracy: 0.7155 | Val Loss: 0.4787 | Val Accuracy: 0.7974
Epoch 2/2 | Train Loss: 0.3671 | Train Accuracy: 0.8621 | Val Loss: 0.4645 | Val Accuracy: 0.8142


### Testing the Model
We evaluate the model's performance on the unseen test set.

In [None]:
# Create test dataset and dataloader
test_dataset = SentimentDataset(test_df, tokenizer)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Evaluate the model on the test set
test_loss, test_accuracy = evaluate_model(test_dataloader, model, device)

# Print test performance
print(f'Test Loss: {test_loss:.4f} | Test Accuracy: {test_accuracy:.4f}')


Test Loss: 0.3957 | Test Accuracy: 0.8306


Since we know from the data preprocessing notebook that the dataset is inherently imbalanced across the three sentiment classes, we evaluate the model performance based on F1 score, a suitable metric when recall and precision must be optimized simultaneously.

In [None]:
# Evaluation loop
def evaluate_model_f1(dataloader, model, device):
    model.eval()
    total_loss = 0
    total_correct = 0
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in dataloader:
            # Move data to the specified device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['sentiment'].to(device)

            # Forward pass
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits

            # Update loss, accuracy, labels, and predictions
            total_loss += loss.item()
            preds = torch.argmax(logits, dim=1)
            total_correct += (preds == labels).sum().item()
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    # Calculate average loss and accuracy
    avg_loss = total_loss / len(dataloader)
    accuracy = total_correct / len(dataloader.dataset)

    # Calculate F1 score
    f1 = f1_score(all_labels, all_preds, average='weighted')

    return avg_loss, accuracy, f1

# Evaluate the model on the test set
test_loss, test_accuracy, test_f1 = evaluate_model_f1(test_dataloader, model, device)

# Print test performance
print(f'Test Loss: {test_loss:.4f} | Test Accuracy: {test_accuracy:.4f} | Test F1 Score: {test_f1:.4f}')

Test Loss: 0.3957 | Test Accuracy: 0.8306 | Test F1 Score: 0.8305


### Testing Strategies to Handle a Small Imbalanced Dataset
We have previously generated more training examples of the minority class `0` (`negative`) sentiment in our data preprocessing notebook. We will re-train the model with the augmented training sets to evaluate the effects of data augmentation on model performance.

### Google Translate Augmentation
We first import the training dataset augmented with Google Translate.

In [None]:
google_translate_train_df = pd.read_csv('google-translate-train.csv', encoding='ISO-8859-1')

google_translate_train_df

Unnamed: 0,sentiment,statement
0,0,Nokia Siemens Networks has struggled to turn a...
1,0,Dolce & Gabbana has asked the European Union t...
2,0,Performance in 2006 was influenced by the cons...
3,0,"National conciliator Juhani Salonius, who met ..."
4,0,The loss after financial items amounted to 9.7...
...,...,...
3479,1,The Estonian beverages maker A. Le Coq today b...
3480,1,"In volume , the focus is already outside Finla..."
3481,1,Finland 's Technopolis is planning to bring th...
3482,1,The shares represented 4.998 % of total share ...


We train the model on the training dataset augmented using Google Translate.

In [None]:
google_translate_model = get_trained_model(google_translate_train_df, val_df)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-medium and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2 | Train Loss: 0.7070 | Train Accuracy: 0.6972 | Val Loss: 0.4841 | Val Accuracy: 0.8026
Epoch 2/2 | Train Loss: 0.3523 | Train Accuracy: 0.8622 | Val Loss: 0.4417 | Val Accuracy: 0.8116


We obtain the performance of the model in terms of the F1 score on the test set.

In [None]:
# Evaluate the model on the test set
google_translate_test_loss, google_translate_test_accuracy, google_translate_test_f1 = evaluate_model_f1(test_dataloader, google_translate_model, device)

# Print test performance
print(f'Test Loss: {google_translate_test_loss:.4f} | Test Accuracy: {google_translate_test_accuracy:.4f} | Test F1 Score: {google_translate_test_f1:.4f}')

Test Loss: 0.4147 | Test Accuracy: 0.8233 | Test F1 Score: 0.8235


### `nlpaug` Augmentation
We import the training dataset augmented with the `nlpaug` library.

In [None]:
nlpaug_train_df = pd.read_csv('nlpaug-train.csv', encoding='ISO-8859-1')

nlpaug_train_df

Unnamed: 0,sentiment,statement
0,1,The operations to be sold include manufacturin...
1,1,L&T has also made a commitment to redeem the r...
2,1,The deal was worth about EUR 1.2 mn .
3,1,FinancialWire tm is not a press release servic...
4,1,The share of the share capital of both above m...
...,...,...
3488,0,"Outokumpu ' s steel john mill in Tornio, in Fi..."
3489,0,The net sales lessen to EUR 49. 8 million from...
3490,0,Finnish electronics contract manufacturer Scan...
3491,0,Finnish IT solutions provider Affecto Oyj hela...


We train the model on the training dataset augmented using `nlpaug`.

In [None]:
nlpaug_model = get_trained_model(nlpaug_train_df, val_df)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-medium and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2 | Train Loss: 0.6772 | Train Accuracy: 0.7091 | Val Loss: 0.4757 | Val Accuracy: 0.8039
Epoch 2/2 | Train Loss: 0.3397 | Train Accuracy: 0.8717 | Val Loss: 0.4089 | Val Accuracy: 0.8258


We obtain the performance of the model in terms of the F1 score on the test set.

In [None]:
# Evaluate the model on the test set
nlpaug_test_loss, nlpaug_test_accuracy, nlpaug_test_f1 = evaluate_model_f1(test_dataloader, nlpaug_model, device)

# Print test performance
print(f'Test Loss: {nlpaug_test_loss:.4f} | Test Accuracy: {nlpaug_test_accuracy:.4f} | Test F1 Score: {nlpaug_test_f1:.4f}')

Test Loss: 0.3626 | Test Accuracy: 0.8440 | Test F1 Score: 0.8435


### GPT-2 Augmentation
We import the training dataset augmented with GPT-2.

In [None]:
gpt_augment_train_df = pd.read_csv('gpt-augment-train.csv', encoding='ISO-8859-1')

gpt_augment_train_df

Unnamed: 0,sentiment,statement,gpt_output
0,1,The operations to be sold include manufacturin...,The operations to be sold include manufacturin...
1,1,L&T has also made a commitment to redeem the r...,L&T has also made a commitment to redeem the r...
2,1,The deal was worth about EUR 1.2 mn .,The deal was worth about EUR 1.2 mn. Given thi...
3,1,FinancialWire tm is not a press release servic...,FinancialWire tm is not a press release servic...
4,1,The share of the share capital of both above m...,The share of the share capital of both above m...
...,...,...,...
3092,1,The Estonian beverages maker A. Le Coq today b...,The Estonian beverages maker A. Le Coq today b...
3093,1,"In volume , the focus is already outside Finla...","In volume, the focus is already outside Finlan..."
3094,1,Finland 's Technopolis is planning to bring th...,Finland's Technopolis is planning to bring the...
3095,1,The shares represented 4.998 % of total share ...,The shares represented 4.998 % of total share ...


We replace the old `statement` column with the `gpt_output` column.

In [None]:
# Drop the 'statement' column
gpt_augment_train_df = gpt_augment_train_df.drop('statement', axis=1)

# Rename the 'gpt_output' column to 'sentiment'
gpt_augment_train_df = gpt_augment_train_df.rename(columns={'gpt_output': 'statement'})

gpt_augment_train_df

Unnamed: 0,sentiment,statement
0,1,The operations to be sold include manufacturin...
1,1,L&T has also made a commitment to redeem the r...
2,1,The deal was worth about EUR 1.2 mn. Given thi...
3,1,FinancialWire tm is not a press release servic...
4,1,The share of the share capital of both above m...
...,...,...
3092,1,The Estonian beverages maker A. Le Coq today b...
3093,1,"In volume, the focus is already outside Finlan..."
3094,1,Finland's Technopolis is planning to bring the...
3095,1,The shares represented 4.998 % of total share ...


We train the model on the training dataset augmented using GPT-2.

In [None]:
gpt_augment_model = get_trained_model(gpt_augment_train_df, val_df)

Downloading (…)lve/main/config.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/167M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-medium and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2 | Train Loss: 0.6943 | Train Accuracy: 0.7052 | Val Loss: 0.5204 | Val Accuracy: 0.7768
Epoch 2/2 | Train Loss: 0.3667 | Train Accuracy: 0.8595 | Val Loss: 0.4370 | Val Accuracy: 0.8155


We obtain the performance of the model in terms of the F1 score on the test set.

In [None]:
# Evaluate the model on the test set
gpt_augment_test_loss, gpt_augment_test_accuracy, gpt_augment_test_f1 = evaluate_model_f1(test_dataloader, gpt_augment_model, device)

# Print test performance
print(f'Test Loss: {gpt_augment_test_loss:.4f} | Test Accuracy: {gpt_augment_test_accuracy:.4f} | Test F1 Score: {gpt_augment_test_f1:.4f}')

Test Loss: 0.3890 | Test Accuracy: 0.8409 | Test F1 Score: 0.8416
