# Sentiment Analysis with Deep Learning using BERT

### Prerequisites

- Intermediate-level knowledge of Python 3 (NumPy and Pandas preferably, but not required)
- Exposure to PyTorch usage
- Basic understanding of Deep Learning and Language Models (BERT specifically)

### Project Outline

**Task 1**: Introduction (this section)

**Task 2**: Exploratory Data Analysis and Preprocessing

**Task 3**: Training/Validation Split

**Task 4**: Loading Tokenizer and Encoding our Data

**Task 5**: Setting up BERT Pretrained Model

**Task 6**: Creating Data Loaders

**Task 7**: Setting Up Optimizer and Scheduler

**Task 8**: Defining our Performance Metrics

**Task 9**: Creating our Training Loop

## Introduction

### What is BERT

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805). 

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)

<img src="Images/BERT_diagrams.pdf" width="1000">

## Exploratory Data Analysis and Preprocessing

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [135]:
import torch                      #Used for deep learning, tensor computations, and GPU acceleration.
import pandas as pd               #Essential for data manipulation and analysis
from tqdm.notebook import tqdm    #A progress bar utility for Jupyter notebooks.

In [136]:
df = pd.read_csv('Data/smile-annotations-final.csv', names=['id', 'text', 'category'])
df.set_index('id', inplace=True)

In [137]:
df.head()

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [138]:
df.category.value_counts()

category
nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|disgust             2
sad|angry               2
sad|disgust|angry       1
Name: count, dtype: int64

In [139]:
df = df[~df.category.str.contains('\\|')]

In [140]:
df = df[df.category != 'nocode']

In [141]:
df.category.value_counts()

category
happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: count, dtype: int64

In [142]:
possible_labels = df.category.unique()

In [143]:
possible_labels

array(['happy', 'not-relevant', 'angry', 'disgust', 'sad', 'surprise'],
      dtype=object)

In [144]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [145]:
df['label'] = df.category.replace(label_dict)

  df['label'] = df.category.replace(label_dict)


In [146]:
df.head()

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0


## Training/Validation Split

In [148]:
from sklearn.model_selection import train_test_split

In [149]:
X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=17, 
                                                  stratify=df.label.values)

In [150]:
df['data_type'] = ['not_set']*df.shape[0]

In [151]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [152]:
df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


## Loading Tokenizer and Encoding our Data

In [154]:
from transformers import BertTokenizer                                                                  
from torch.utils.data import TensorDataset

"""
BertTokenizer (from Hugging Face transformers library)
This is a tokenizer specifically designed for BERT (Bidirectional Encoder Representations from Transformers).It converts raw text into tokenized input,
ready for feeding into a BERT model.
Tokenization includes:
Splitting text into subwords/tokens.
Adding special tokens like [CLS] and [SEP].
Converting tokens into numerical representations (input IDs).
Creating attention masks (indicating which tokens are padding).
"""

"""
TensorDataset (from PyTorch torch.utils.data)
Converts multiple tensors into a single dataset.
Useful for handling model inputs (like tokenized text) and labels together.
Often used with PyTorch’s DataLoader to efficiently process data in batches.
"""

'\nTensorDataset (from PyTorch torch.utils.data)\nConverts multiple tensors into a single dataset.\nUseful for handling model inputs (like tokenized text) and labels together.\nOften used with PyTorch’s DataLoader to efficiently process data in batches.\n'

In [155]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)
# Loads a pre-trained tokenizer from Hugging Face’s model hub. Uncased version (converts all text to lowercase)
# do_lower_case: Ensures all text is converted to lowercase before tokenization.

In [156]:
# tokenizes a dataset using BERT’s tokenizer and converts it into tensors for training and validation.

'''
The batch_encode_plus() function processes multiple texts at once, performing:

Tokenization: Splitting text into subwords
Padding: Ensuring all sequences are the same length (max_length=256)
Truncation: Truncating longer texts to fit within max_length
Special Tokens: Adding [CLS] and [SEP] tokens
Returning PyTorch Tensors: return_tensors='pt' converts output to PyTorch tensors
Returns tokenized inputs and attention masks
'''
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

# Extracting Tokenized Outputs

input_ids_train = encoded_data_train['input_ids']                       # Token IDs of words in each sentence
attention_masks_train = encoded_data_train['attention_mask']           # Masking (1 for real tokens, 0 for padding)
labels_train = torch.tensor(df[df.data_type=='train'].label.values)    # Converts labels into PyTorch tensors for training

# Extracts input IDs, attention masks, and labels for validation
input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [157]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

# TensorDataset is a simple dataset wrapper provided by torch.utils.data, 
# which allows you to store multiple tensors together and access them using standard dataset indexing.
# The dataset stores all three components together.
# you can efficiently load mini-batches for training using a DataLoader.

In [158]:
len(dataset_train)

1258

In [159]:
len(dataset_val)

223

## Setting up BERT Pretrained Model

In [161]:
from transformers import BertForSequenceClassification

'''
BertForSequenceClassification is a pre-trained BERT model from Hugging Face, designed for classification tasks.
It adds a classification head (fully connected layer) on top of BERT's last hidden state.

BERT processes the input text and generates contextualized embeddings.
A classification layer (a simple dense layer) is added on top of the [CLS] token output.
This final layer predicts the class label (e.g., positive/negative, spam/ham).

'''

"\nBertForSequenceClassification is a pre-trained BERT model from Hugging Face, designed for classification tasks.\nIt adds a classification head (fully connected layer) on top of BERT's last hidden state.\n\nBERT processes the input text and generates contextualized embeddings.\nA classification layer (a simple dense layer) is added on top of the [CLS] token output.\nThis final layer predicts the class label (e.g., positive/negative, spam/ham).\n\n"

In [162]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",  # Name of the pre-trained model
                                                      num_labels=len(label_dict), # Number of output labels for classification
                                                      output_attentions=False, # Do not output attention weights
                                                      output_hidden_states=False) # Do not output hidden states

# The model uses Cross-Entropy Loss. the loss function is implicitly defined inside the model itself. Since you are using BERT for sequence classification,
# the loss function is handled automatically by the BertForSequenceClassification model from Hugging Face’s transformers library.

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Creating Data Loaders

In [164]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

'''
DataLoader: A utility in PyTorch to efficiently load data in batches for training or evaluation. 
            It supports shuffling, batching, and parallel data loading using multiple workers.
RandomSampler: A sampler that samples elements randomly from the dataset. 
               Typically used for training, where you want to ensure the model sees the data in a random order to improve generalization.
SequentialSampler: A sampler that samples elements sequentially, i.e., in the same order as they appear in the dataset. 
                   This is commonly used during validation or evaluation to ensure the data is processed in a consistent order (often for reproducibility).
'''

'\nDataLoader: A utility in PyTorch to efficiently load data in batches for training or evaluation. \n            It supports shuffling, batching, and parallel data loading using multiple workers.\nRandomSampler: A sampler that samples elements randomly from the dataset. \n               Typically used for training, where you want to ensure the model sees the data in a random order to improve generalization.\nSequentialSampler: A sampler that samples elements sequentially, i.e., in the same order as they appear in the dataset. \n                   This is commonly used during validation or evaluation to ensure the data is processed in a consistent order (often for reproducibility).\n'

In [165]:
batch_size = 32

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train),  #shuffles the training data before each epoch. This is essential for avoiding bias in training and ensuring the model learns general features, not just the order of the data.
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), #ensures that the validation data is sampled in the same order for each evaluation pass. This helps maintain consistency in validation results.
                                   batch_size=batch_size)

## Setting Up Optimiser and Scheduler

In [167]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [168]:
# Use AdamW for fine-tuning Transformers (BERT, GPT, T5, etc.).Decoupled Weight Decay (AdamW) regularization technique

optimizer = AdamW(model.parameters(), #Passes the parameters (weights and biases) of the model to the optimizer.
                  lr=1e-5, # Learning rate
                  eps=1e-8) # Epsilon (for numerical stability)

# this setup is widely used for fine-tuning models like BERT, GPT, and T5



In [169]:
epochs = 3

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)
# Learning Rate Scheduler (get_linear_schedule_with_warmup):
# The scheduler is responsible for adjusting the learning rate (LR) during training. 
# Specifically, it implements a linear decay schedule for the learning rate, starting with a warm-up phase and gradually decaying the learning rate as training progresses.

## Defining our Performance Metrics

Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification).

In [172]:
import numpy as np

In [173]:
from sklearn.metrics import f1_score

In [174]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [175]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

## Creating our Training Loop

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [178]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [179]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cpu


In [180]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [181]:
# This code implements a training loop for fine-tuning BERT using PyTorch and the transformers library. It trains the model for multiple epochs, 
# calculates the loss, updates the parameters using an optimizer, and adjusts the learning rate using a scheduler.

for epoch in tqdm(range(1, epochs+1)): #Iterates through the total number of epochs.tqdm provides a progress bar for visualization.
    
    model.train()  #Switches the model to training mode (enables dropout, batch norm, etc.).
    
    loss_train_total = 0  #Stores the total loss for the epoch.

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar: #Loops through batches of training data.

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)  #Moves input_ids, attention_mask, and labels to GPU or CPU.
        
        inputs = {'input_ids':      batch[0],   #Define Input Dictionary for BERT. These inputs are passed to BERT for processing.
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)  #Forward propagation: The model makes predictions.
        
        loss = outputs[0]     # Loss is stored in the first output element. Loss computation: Hugging Face’s BertForSequenceClassification automatically calculates Cross-Entropy Loss.
        loss_train_total += loss.item()   #Keeps track of total training loss.
        loss.backward()  #Computes gradients using backpropagation.

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Gradient Clipping (Prevents Exploding Gradients). Limits gradients to 1.0 to avoid unstable updates.

        optimizer.step()  #Gradient Descent Update: Adjusts model weights using AdamW optimizer.
        scheduler.step()  #Adjusts learning rate using get_linear_schedule_with_warmup().
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})  #shows loss per batch in progress bar
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')  # Save Model Checkpoint. Saves the model’s weights after each epoch.
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)          #Computes average training loss for the epoch.   
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation) #Evaluate the Model on Validation Set. Calls evaluate() function to:Compute validation loss.Generate predictions.Get true labels.
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}') #Displays Validation Loss and F1 Score.
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

  0%|          | 0/3 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/40 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

In [None]:
model.load_state_dict(torch.load('Models/finetuned_bert_epoch_1_gpu_trained.model', map_location=torch.device('cpu')))

In [None]:
_, predictions, true_vals = evaluate(dataloader_validation)

In [None]:
accuracy_per_class(predictions, true_vals)

## Summary of the Training Process
Step
1. Load Data:          Mini-batches from dataloader_train
2. Forward Pass:	   Model predicts labels
3. Compute Loss: 	   Using Cross-Entropy Loss
4. Backpropagation:	   Compute gradients
5. Clip Gradients:	   Prevents exploding gradients
6. Update Weights:	   Using AdamW optimizer
7. Adjust Learning Rate: Using Scheduler
8. Evaluate on Validation Set:	Compute Validation Loss & F1 Score