# Sentiment Analysis Using Transformers - Bert

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import random

import torch
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertTokenizer, BertForSequenceClassification, get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Set the seed values to achieve consistent results on re-run

seed_val = 10
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

## Reading and Preprocessing the Data

In [None]:
# Read Data
train_data = pd.read_csv("/content/drive/MyDrive/ML_Project/train.csv")
test_x = pd.read_csv("/content/drive/MyDrive/ML_Project/test.csv")
test_y = pd.read_csv("/content/drive/MyDrive/ML_Project/test_labels.csv")

Some comments in train and test dataset have -1 as their label for the sentiments. As mentioned on the dataset's kaggle page, the labels for these comments do not exist. Therefore, we have removed these comments from the datsets.

In [None]:
target_labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

In [None]:
# Drop the rows whose labels are not available
train_data = train_data[(train_data['toxic'] != -1) & (train_data['severe_toxic'] != -1) & (train_data['obscene'] != -1) & (train_data['threat'] != -1) & (train_data['insult'] != -1) & (train_data['identity_hate'] != -1)]

# Combine test_x and test_y
test_data = pd.concat([test_x, test_y], axis=1).sample(frac=1, random_state = 10)
test_data = test_data[(test_data['toxic'] != -1) & (test_data['severe_toxic'] != -1) & (test_data['obscene'] != -1) & (test_data['threat'] != -1) & (test_data['insult'] != -1) & (test_data['identity_hate'] != -1)]

print(train_data.shape)
print(test_data.shape)

(63978, 9)

In [8]:
# Seperate test_x and test_y from test_data

test_x = np.array(test_data["comment_text"])
test_y = np.array(test_data[target_labels])

print(f"test_x {test_x.shape}  ;  test_y {test_y.shape}")

test_x (63978,)  ;  test_y (63978, 6)  ;  test_labels (63978,)


In [9]:
# Seperate train_x and train_y from train_data

train_data_x = np.array(train_data["comment_text"])
train_data_y = np.array(train_data[target_labels])

print(f"train_x {train_data_x.shape}  ;  train_y {train_data_y.shape}")

train_x (159571,)  ;  train_y (159571, 6)


We lowercase and remove any extra spaces from the comments. The rest of the heavy lifting to make the inputs suitable is left to the Bert Tokenizer (used later on).

In [10]:
# Clean Data

def lowercase(txt):
  return txt.lower()

def remove_punctuation(txt):
  return re.sub(r"[^\w\s\d]", "", txt)

def remove_numbers(txt):
  return re.sub(r"\d", "", txt)

def remove_extra_spaces(txt):
  return " ".join(txt.split())

def normalize_sentence(txt):
  txt = str(txt)
  txt = lowercase(txt)
  # txt = remove_punctuation(txt)
  # txt = remove_numbers(txt)
  txt = remove_extra_spaces(txt)
  return txt

for idx, comment in enumerate(train_data_x):
  train_data_x[idx] = normalize_sentence(comment)

for idx, comment in enumerate(test_x):
  test_x[idx] = normalize_sentence(comment)

In [11]:
# Split dataset into train, validation sets (70-30)

train_x, val_x, train_y, val_y = train_test_split(train_data_x, train_data_y, test_size=0.3, random_state=10)

print(f"train_x {train_x.shape}  ;  train_y {train_y.shape}")
print(f"val_x {val_x.shape}  ;  val_y {val_y.shape}")

train_x (111699,)  ;  train_y (111699, 6)
val_x (47872,)  ;  val_y (47872, 6)


## Using BERT

BERT (Bidirectional Encoder Representations from Transformers) is a decoder-only transformer-based model designed for natural language understanding. Since we need a model that understand the text/comment we provide it and assigns it a toxicity label, BERT would be a good choice as it can understand the context of words (in both both left and right direction) in a sentence by pre-training a deep bidirectional representation of text.

#### Tokenizing

The input are the comments (text based and variable sized), hence we use the BertTokenizer from the transformers modules provided by HuggingFace to convert the data into something our model can learn from.

Specifically, we use the BertTokenizer from the bert-base-uncased model, which means it uses a pre-trained BERT model with lowercase tokens. We encode all the comments by converting them into token IDs. Token IDs are unique numerical representations assigned to each token in a text sequence, enabling a model like BERT to understand and process the input by mapping tokens to corresponding embeddings and capturing positional information. They encode the text into a format the model comprehends, ensuring accurate interpretation and learning from the sequence during training. 

We enable the use of special tokens like [CLS] (start token - appended at the start of each sequence) and [SEP] (separator token - signifies the end of a segment within a sentence or the end of one sentence the beginning of another). We pad (represented by [PAD] token) or truncate the sequences to the maximum length (512 by default) to maintain a consistent input size to enable batch processing (used later on). The tokenizer randomly masks certain token in the input sequence (represented by [MASK] token). This is helpful while training the model, as the model can be trained to predict the original tokens from the masked positions, encouraging it to learn robust contextual representations. We also return the attention mask that we will provide our model during training so that the model differntiate between actual tokens and padding tokens, enabling the model to know which token to pay attention to and which tokens to ignore.

In [12]:
# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

def tokenize_data(tokenizer, data_x, data_y):
  input_ids = []
  token_type_ids = []
  attention_masks = []

  # Loop through the comments as the tokenizer takes in a single comment at a time.
  for comment in data_x:
    # Encode the comment. 
    encoded_dict = tokenizer.encode_plus(
    comment,
    add_special_tokens=True,
    padding='max_length',
    truncation=True,
    return_attention_mask=True,
    return_tensors='pt',)

    # Store the encoded information
    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])
    token_type_ids.append(encoded_dict['token_type_ids'])

  # Convert the information lists to tensors
  input_ids = torch.cat(input_ids, dim=0)
  attention_masks = torch.cat(attention_masks, dim=0)
  token_type_ids = torch.cat(token_type_ids, dim=0)
  labels = torch.tensor(data_y, dtype=torch.float32)

  return input_ids, attention_masks, token_type_ids, labels

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [13]:
# Tokenize all the data

train_input_ids, train_attention_masks, train_token_type_ids, train_labels = tokenize_data(tokenizer, train_x, train_y)
val_input_ids, val_attention_masks, val_token_type_ids, val_labels = tokenize_data(tokenizer, val_x, val_y)
test_input_ids, test_attention_masks, test_token_type_ids, test_labels = tokenize_data(tokenizer, test_x, test_y)

In [14]:
# Combine the data (token_ids, attention_masks, true_labels) in the individual tensors and create a TensorDataset for train, validation, and test data. We use DataLoader to produce iterable batches of our dataset for more efficient processing.

batch_size = 8

# train
train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels)
train_loader = DataLoader(train_dataset, batch_size=batch_size)

# val
val_dataset = TensorDataset(val_input_ids, val_attention_masks, val_labels)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

# test
test_dataset = TensorDataset(test_input_ids, test_attention_masks, test_labels)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

In [16]:
# Save the TensorDatasets

torch.save(train_dataset, "/content/drive/MyDrive/ML_Project/train_dataset.pt")
torch.save(val_dataset, "/content/drive/MyDrive/ML_Project/val_dataset.pt")
torch.save(test_dataset, "/content/drive/MyDrive/ML_Project/test_dataset.pt")

## Setting up the Model

We use the pretrained bert model from the transformers module provided by HuggingFace. Specifically, we use 'bert-base-uncased' model, which is a pretrained bert model on lower case inputs. We define the number of labels as 6, since a comment can belong to multiple of the following categories: toxic, severe_toxic, obscene, threat, insult, and identity_hate. 

In [17]:
# Set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define the model architecture.
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=6)
model.to(device)

# Set up the Adam optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=0.000025)

# Set the number of epochs
num_epochs = 1 # Takes a lot time to train. Unortunately Colab GPU timesout:(

# We use a scheduler to adjust the learning rate while training the model. We start from the assigned learning rate (0.000025) and the scheduler follows a linear decay schedule to decrease the learning rate with each iteration. 
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=len(train_dataset)*num_epochs)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Training and Evaluating

BERT utilizes two major training methods during its pre-training phase: Masked Language Model (MLM) and Next Sentence Prediction (NSP).

1. Masked Language Model (MLM):
As mentioned above, BERT masks some of the words in the input sentence with a [MASK] token. The model tries to predict some randomly masked tokens by using the sorrounding context (both on left and right side) as clues. This training method allows BERT to learn the (English) language and proper word usage in a sentence.

2. Next Sentence Prediction (NSP):
NSP is used to help BERT learn about relationships between sentences by predicting if a given sentence follows the previous sentence or not. BERT is trained on a task where it predicts whether a pair of sentences are in the original order or if they are randomly swapped. The model uses [CLS] and [SEP] to pick out random sentences/segments for this method of training. This training method allows BERT to comprehend how sentences relate to each other in a larger context.

We are using a pre-trained BERT model from transformers module that has been trained on a large corpus of text using the MLM and NSP objectives. We then use our dataset of comments and their toxicity labels to fine-tune the model and make it more better/specialized at that task.

In [18]:
def evaluate(model, data_loader, device):
  model.eval()  # Set the model to evaluation mode
  true_labels = []
  predicted_probabilities = []
  total_loss = 0.0

  # Disable gradient computation during validation
  with torch.no_grad():
    for batch in data_loader: # Iterate through each batch
      
      # Extract the necessary information 
      input_ids = batch[0].to(device)
      attention_masks = batch[1].to(device)
      labels = batch[2].to(device)

      # Provide the information to the model and generate output
      outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
      
      # By applying sigmoid function on the output logits, we get the probabilities for each target type/label (toxic, severe_toxic, etc). We store the probabilities of the label for the batch
      predicted_probabilities_batch = torch.sigmoid(outputs.logits).cpu().numpy()
      predicted_probabilities.append(predicted_probabilities_batch)

      # We store the true labels of the inputs in the batch
      true_labels_batch = labels.int().cpu().numpy()
      true_labels.append(true_labels_batch)

      # Add the loss of the current batch to the total_loss
      loss = outputs.loss
      total_loss += loss.item()

  # Convert the lists to numpy arrays for efficient processing
  true_labels = np.concatenate(true_labels, axis=0)
  predicted_probabilities = np.concatenate(predicted_probabilities, axis=0)

  # Apply threshold of 0.5 to get the 0/1 value for each label
  # For example: if the predicted probabilities for an input/comment is [0.6 0.4 0.3 0.7 0.1], it will be converted to [1 0 0 1 0]
  predicted_labels = (predicted_probabilities > 0.5).astype(int)

  # Calculate the evaluation metrics and return them
  accuracy = accuracy_score(true_labels, predicted_labels)
  avg_loss = total_loss/len(data_loader)
  report = classification_report(true_labels, predicted_labels, target_names=target_labels)
  
  return avg_loss, accuracy, report

In [19]:
def fit(model, train_loader, val_loader, optimizer, scheduler, num_epochs, device):
  training_losses = []
  validation_losses = []
  validation_accuracies = []
  validation_reports = []

  # Training loop
  for epoch in range(num_epochs):
    model.train() # Set the model to train mode
    total_loss = 0

    for idx, batch in enumerate(train_loader): # Iterate through each batch
      
      # Extract the necessary information 
      input_ids = batch[0].to(device)
      attention_masks = batch[1].to(device)
      labels = batch[2].to(device)

      # Do a forward pass, calculate the loss
      optimizer.zero_grad()
      outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
      loss = outputs.loss
      logits = outputs.logits
      total_loss += loss.item()
      
      # Backpropagate the loss and step the optimizer to adjust the paramters of the model. Step the scheduler to adjust the learning rate.
      loss.backward()
      optimizer.step()
      scheduler.step()

      if (idx % 100 == 0):
        print(idx)

    # Evaluate on the validation set
    val_loss, val_acc, val_report = evaluate(model, val_loader, device)
    avg_train_loss = total_loss/len(train_loader)
    avg_val_loss = val_loss/len(val_loader)

    # Store the evaluation metrics calculated
    training_losses.append(avg_train_loss)
    validation_losses.append(avg_val_loss)
    validation_accuracies.append(val_acc)
    validation_reports.append(val_report)

    # Print the average loss for the current epoch and the classification report of the evaluation on the validation set
    print(f'Epoch {epoch+1}, Training Loss: {avg_train_loss}, Validation Loss: {avg_val_loss}, Validation Accuracy: {val_acc}')
    print('Validation Classification Report\n', val_report)

    # Save the trained model
    torch.save(model.state_dict(), f"/content/drive/MyDrive/ML_Project/trained_model_epoch{epoch}.pth")

  return training_losses, validation_losses, validation_accuracies, validation_reports

In [20]:
# Train the model

training_losses, validation_losses, validation_accuracies, validation_reports = fit(model, train_loader, val_loader, optimizer, scheduler, num_epochs, device)

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700
8800
8900
9000
9100
9200
9300
9400
9500
9600
9700
9800
9900
10000
10100
10200
10300
10400
10500
10600
10700
10800
10900
11000
11100
11200
11300
11400
11500
11600
11700
11800
11900
12000
12100
12200
12300
12400
12500
12600
12700
12800
12900
13000
13100
13200
13300
13400
13500
13600
13700
13800
13900


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Epoch 1, Training Loss: 0.04801060528770128, Validation Loss: 6.794906342469118e-06, Validation Accuracy: 0.9258230280748663
Validation Classification Report
                precision    recall  f1-score   support

        toxic       0.82      0.84      0.83      4492
 severe_toxic       0.00      0.00      0.00       465
      obscene       0.90      0.73      0.81      2498
       threat       0.60      0.33      0.42       168
       insult       0.81      0.66      0.73      2325
identity_hate       0.80      0.25      0.39       445

    micro avg       0.83      0.70      0.76     10393
    macro avg       0.65      0.47      0.53     10393
 weighted avg       0.80      0.70      0.74     10393
  samples avg       0.07      0.07      0.07     10393



In [21]:
# Test the model

test_loss, test_acc, test_report = evaluate(model, test_loader, device)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [23]:
# See how the model performs on the test data

print("Test Loss: ", test_loss)
print("Test Accuracy: ", test_acc)
print("Test Classification Report")
print(test_report)

Test Loss:  0.06464410638276855
Test Accuracy:  0.8745037356591329
Test Classification Report
               precision    recall  f1-score   support

        toxic       0.55      0.88      0.68      6090
 severe_toxic       0.00      0.00      0.00       367
      obscene       0.72      0.68      0.70      3691
       threat       0.65      0.34      0.45       211
       insult       0.73      0.58      0.65      3427
identity_hate       0.85      0.29      0.44       712

    micro avg       0.62      0.70      0.66     14498
    macro avg       0.58      0.46      0.48     14498
 weighted avg       0.64      0.70      0.64     14498
  samples avg       0.08      0.07      0.07     14498



#### Evaluation

The model has a loss of 0.048 on train dataset and a loss of 6.78*10^-6 on validation dataset. It achieves an accuracy of 92.5% on validation dataset and 87.5% accuracy on the test dataset. This means it classifies a comment correctly correctly decides majority of the times. However, the value of accuracy can sometimes be misleading. When we have many different labels to predict, the overall accuracy might not show how the model does on each specific label, especially if some labels are much more common than others.

Toxic, Obscene, and Insult classes show relatively higher precision, recall, and F1-scores, indicating better model performance in predicting these categories. Severe_toxic and Threat classes have lower precision, recall, and F1-scores, suggesting challenges in accurately predicting these categories. For example, severe_toxic has zeros across precision, recall, and F1-score in this evaluation, indicating no correct predictions for this class. Identity_hate has high precision value but low recall and F1-score, indicating that the model has difficulty predicting a identity_hate comment, however when it does classifies a comment as identity_hate, it does so correctly most often. 