## Introduction

The dataset was taken from here: https://huggingface.co/datasets/emotion. The model used to classify emotions in text messages is BERT, that is very popular for solving NLP tasks

If there are any issues with gg plot on GitHub - [check this notebook in Kaggle](https://www.kaggle.com/code/xyinspired/bert-sentiment-emotions-analysis)

## Preparation

In [1]:
!pip install transformers lets_plot -q

[0m

In [2]:
from lets_plot import *
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, DistilBertTokenizer, BertForSequenceClassification
from sklearn.metrics import accuracy_score, f1_score

import matplotlib.pyplot as plt 
import pytorch_lightning as pl
import pandas as pd
import numpy as np
import os
import pickle
import time
import torch.nn as nn
import torch
import warnings

warnings.filterwarnings('ignore')

In [3]:
if torch.cuda.is_available():    
    device = torch.device("cuda")
    print(f'Using GPU : {torch.cuda.get_device_name(0)}')
else:
    device = torch.device("cpu")
    print(f'Using CPU')

Using GPU : Tesla P100-PCIE-16GB


In [4]:
seed = 42

In [5]:
LetsPlot.setup_html()

## Data Loading & Preprocessing

In [6]:
path_to_data_pkl = '/kaggle/input/emotionsdata/merged_training.pkl'

In [7]:
with open(path_to_data_pkl, 'rb') as file:
    data = pickle.load(file)
    print(f'Got data of shape : {data.shape}')

Got data of shape : (416809, 2)


In [8]:
data.rename(columns={'emotions' : 'label'}, inplace=True)
data.head()

Unnamed: 0,text,label
27383,i feel awful about it too because it s my job ...,sadness
110083,im alone i feel awful,sadness
140764,ive probably mentioned this before but i reall...,joy
100071,i was feeling a little low few days back,sadness
2837,i beleive that i am much more sensitive to oth...,love


In [9]:
data.isna().sum()

text     0
label    0
dtype: int64

In [10]:
frequency = data.label.value_counts()
frequency

joy         141067
sadness     121187
anger        57317
fear         47712
love         34554
surprise     14972
Name: label, dtype: int64

In [11]:
frequency = pd.DataFrame({
    'Labels' : frequency.index,
    'Total' : frequency.values
})

In [12]:
ggplot(frequency, aes(x=frequency.Labels, weight=frequency.Total, fill=frequency.Labels)) + \
    geom_bar() + labs(x='Label', y='Times Occured')

As we can see, we got a lot of text messages with labeled emotions - `joy`, `sadness`, `anger`, `fear`, `love`, `surprise`.

In [13]:
label_to_id = {'0' : 'joy', '1' : 'sadness', '2' : 'anger', '3' : 'fear', '4' : 'love', '5' : 'surprise'}
id_to_label = {v : k for k, v in label_to_id.items()}

We'll map each label to its id

In [14]:
data.label = data.label.map(id_to_label)

In [15]:
data.head()

Unnamed: 0,text,label
27383,i feel awful about it too because it s my job ...,1
110083,im alone i feel awful,1
140764,ive probably mentioned this before but i reall...,0
100071,i was feeling a little low few days back,1
2837,i beleive that i am much more sensitive to oth...,4


Next we create own custom Dataloader based on our splitted into `train`, `val`, `test` dataset

In [16]:
TEST_SIZE = 0.2
VAL_SIZE = 0.1
TRAIN_SIZE = 1 - (VAL_SIZE + TEST_SIZE)

In [17]:
X_train, X_test, y_train, y_test = train_test_split(data.text, data.label, test_size=TEST_SIZE, random_state=seed)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=VAL_SIZE, random_state=seed) 

In [18]:
class SentimentTextDataset(Dataset):
    def __init__(self, text, labels):
        self.text = text
        self.labels = labels
        
    def __len__(self):
        assert len(self.text) == len(self.labels)
        return len(self.labels)
    
    def __getitem__(self, index):
        text_msg = self.text.iloc[index]
        label = self.labels.iloc[index]
        return text_msg, label

In [19]:
train_dataset = SentimentTextDataset(X_train, y_train)
val_dataset = SentimentTextDataset(X_val, y_val)
test_dataset = SentimentTextDataset(X_test, y_test)

In [20]:
print(f'Train set size: {len(train_dataset)} - {int(TRAIN_SIZE * 100)}% of all data')
print(f'Validation set size : {len(val_dataset)} - {int(VAL_SIZE * 100)}% of all data')
print(f'Test set size : {len(test_dataset)} - {int(TEST_SIZE * 100)}% of all data')

Train set size: 300102 - 70% of all data
Validation set size : 33345 - 10% of all data
Test set size : 83362 - 20% of all data


We do not really much epochs of training on our dataset, since we only fine-tune our already pretrained on large corpus model. By the way, training one epoch takes 2-3 hours of GPU computation and with more epochs it will actually overfit quite fast.

In [21]:
BATCH_SIZE = 48
LR = 4e-5
EPOCH = 2

In [22]:
train_data_loader = DataLoader(train_dataset,
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              num_workers=8)

val_data_loader = DataLoader(val_dataset,
                              batch_size=BATCH_SIZE,
                              shuffle=False,
                              num_workers=8)

test_data_loader = DataLoader(test_dataset,
                              batch_size=BATCH_SIZE,
                              shuffle=False,
                              num_workers=8)

## BERT

We are going to use BERT transformer model. We need to tokenize text before passing it to our model. 

### Set-Up

In [23]:
model_name_bert = 'bert-base-uncased'

In [24]:
tokenizer = BertTokenizer.from_pretrained(model_name_bert)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [25]:
max_len = np.zeros(len(data))
for i in range(len(data)):
    input_ids = tokenizer.encode(data.text.iloc[i], add_special_tokens=True)
    max_len[i] = len(input_ids)
print('Max length: ', max_len.max())

del input_ids
del max_len

Max length:  185.0


We also need to provide our model with `pytorch lightning module` functionality

In [26]:
class BertModel(pl.LightningModule):
    def __init__(self, tokenizer, max_len, lr=LR):
        super().__init__()
        self.model = BertForSequenceClassification.from_pretrained(model_name_bert, num_labels=data.label.nunique())        
        self.loss = nn.functional.cross_entropy
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.softmax = nn.Softmax()
        self.lr = lr
        self.losses_train = []
        self.losses_val = []
        self.acc_train = []
        self.acc_val = []
        self.f1_train = []
        self.f1_val = []
        
    def forward(self, x):
        encoding = self.tokenizer(
          x,
          add_special_tokens=True,
          max_length=self.max_len,
          return_token_type_ids=False,
          pad_to_max_length=True,
          return_attention_mask=True,
          return_tensors='pt',
          truncation=True
        ).to(device)
        
        return self.model(**encoding)
    
    def training_step(self, batch, batch_idx):
        text_msg, y_true = batch
        y_true = torch.tensor(np.array(list(y_true)).astype(np.uint8)).to(device)
        logits = self.forward(text_msg).logits
        loss = self.loss(logits, y_true)
        softmax = self.softmax(logits)
        y_pred = torch.argmax(softmax, dim=1).to(device)
        
        accuracy = torch.tensor(accuracy_score(y_true.cpu().numpy(), y_pred.cpu().numpy()))
        f1 = torch.tensor(f1_score(y_true.cpu().numpy(), y_pred.cpu().numpy(), average="macro"))
        self.losses_train.append(loss)
        self.acc_train.append(accuracy)
        self.f1_train.append(f1)
        return {'loss' : loss, 'accuracy' : accuracy, 'f1' : f1}
    
    def validation_step(self, batch, batch_idx):
        text_msg, y_true = batch
        y_true = torch.tensor(np.array(list(y_true)).astype(np.uint8)).to(device)
        logits = self.forward(text_msg).logits
        loss = self.loss(logits, y_true)
        softmax = self.softmax(logits)
        y_pred = torch.argmax(softmax, dim=1).to(device)
        
        accuracy = torch.tensor(accuracy_score(y_true.cpu().numpy(), y_pred.cpu().numpy()))
        f1 = torch.tensor(f1_score(y_true.cpu().numpy(), y_pred.cpu().numpy(), average="macro"))
        self.losses_val.append(loss)
        self.acc_val.append(accuracy)
        self.f1_val.append(f1)
        return {'val_loss' : loss, 'val_accuracy' : accuracy, 'val_f1' : f1}
    
    def training_epoch_end(self, outputs):
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
        avg_acc = torch.stack([x['accuracy'] for x in outputs]).mean()
        avg_f1 = torch.stack([x['f1'] for x in outputs]).mean()
        
        print(f"Train_loss: {avg_loss:.2f}")
        print(f"Train_accuracy: {avg_acc:.2f}")
        print(f"Train_f1: {avg_f1:.2f}")
        
        self.log('loss', avg_loss, prog_bar=True, on_epoch=True, on_step=False)
        
    def predict_step(self, batch, batch_idx):
        if isinstance(batch, list):
            if len(batch) > 1:
                text_msg, _ = batch
        else:
            text_msg = batch
        output = self.forward(text_msg).logits.to(device)
        probs =  self.softmax(output).to(device)
        return torch.argmax(probs, dim=1)
        
    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        avg_acc = torch.stack([x['val_accuracy'] for x in outputs]).mean()
        avg_f1 = torch.stack([x['val_f1'] for x in outputs]).mean()
        
        print(f"Val_loss: {avg_loss:.2f}", end= " ")
        print(f"Val_accuracy: {avg_acc:.2f}", end= " ")
        print(f"Val_f1 {avg_f1:.2f}", end= " ")
        
        self.log('val_loss', avg_loss, prog_bar=True, on_epoch=True, on_step=False)
    
    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=self.lr, weight_decay=1e-6)

### Training

In [27]:
bert_model = BertModel(tokenizer, max_len=256)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [28]:
model_check_point = ModelCheckpoint(dirpath='runs/bert_emotion',
                                    filename='{epoch}-{val_loss:.3f}',
                                    monitor='val_loss', 
                                    mode='min', 
                                    save_top_k=1)

In [29]:
trainer = pl.Trainer(
    max_epochs=EPOCH,
    gpus=1,
    callbacks=[model_check_point],
    log_every_n_steps=5
)

print(f'Training started...')
train_start = time.time()

trainer.fit(bert_model, train_data_loader, val_data_loader)

train_finish = time.time()
print(f'Training finished after {((train_finish - train_start) / 60):.1f} minutes')

Training started...


Sanity Checking: 0it [00:00, ?it/s]

Val_loss: 1.77 Val_accuracy: 0.28 Val_f1 0.10 

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Val_loss: 0.09 Val_accuracy: 0.94 Val_f1 0.90 Train_loss: 0.13
Train_accuracy: 0.93
Train_f1: 0.88


Validation: 0it [00:00, ?it/s]

Val_loss: 0.09 Val_accuracy: 0.94 Val_f1 0.90 Train_loss: 0.09
Train_accuracy: 0.94
Train_f1: 0.89
Training finished after 244.8 minutes


In [30]:
train_step_losses = [i.detach().cpu().item() for i in bert_model.losses_train]
val_step_losses = [i.detach().cpu().item() for i in bert_model.losses_val]

train_accuracy_scores = [i.detach().cpu().item() for i in bert_model.acc_train]
val_accuracy_scores = [i.detach().cpu().item() for i in bert_model.acc_val]

train_f1_scores = [i.detach().cpu().item() for i in bert_model.f1_train]
val_f1_scores = [i.detach().cpu().item() for i in bert_model.f1_val]

train_steps = [i for i in range(len(train_step_losses))]
val_steps = [i for i in range(len(val_step_losses))]

loss_stats_train = pd.DataFrame({
    'step' : train_steps,
    'train_loss' : train_step_losses,
})

loss_stats_val = pd.DataFrame({
    'step' : val_steps,
    'val_loss' : val_step_losses
})

acc_stats_train = pd.DataFrame({
    'step' : train_steps,
    'train_acc' : train_accuracy_scores,
})

acc_stats_val = pd.DataFrame({
    'step' : val_steps,
    'val_acc' : val_accuracy_scores 
})

f1_stats_train = pd.DataFrame({
    'step' : train_steps,
    'train_f1' : train_f1_scores,
})

f1_stats_val = pd.DataFrame({
    'step' : val_steps,
    'val_f1' : val_f1_scores
})

In [31]:
bunch = GGBunch()
plot = ggplot(loss_stats_train) + geom_path(aes('step', 'train_loss'), size=1.3, color='blue') + ggsize(500, 400) + ggtitle('Train Loss')
bunch.add_plot(plot, 100, 0)
plot = ggplot(loss_stats_val) + geom_path(aes('step', 'val_loss'), size=1.3, color='red') + ggsize(500, 400) + ggtitle('Validation Loss')
bunch.add_plot(plot, 700, 0)
bunch.show()

In [32]:
bunch = GGBunch()
plot = ggplot(acc_stats_train) + geom_path(aes('step', 'train_acc'), size=1.3, color='blue') + ggsize(500, 400) + ggtitle('Train Accuracy')
bunch.add_plot(plot, 100, 0)
plot = ggplot(acc_stats_val) + geom_path(aes('step', 'val_acc'), size=1.3, color='red') + ggsize(500, 400) + ggtitle('Validation Accuracy')
bunch.add_plot(plot, 700, 0)
bunch.show()

In [33]:
bunch = GGBunch()
plot = ggplot(f1_stats_train) + geom_path(aes('step', 'train_f1'), size=1.3, color='blue') + ggsize(500, 400) + ggtitle('Train F1-score')
bunch.add_plot(plot, 100, 0)
plot = ggplot(f1_stats_val) + geom_path(aes('step', 'val_f1'), size=1.3, color='red') + ggsize(500, 400) + ggtitle('Validation F1-score')
bunch.add_plot(plot, 700, 0)
bunch.show()

### Inference

Let's check our model's performance on yet unseen test data 

In [34]:
y_hat_test = trainer.predict(bert_model, test_data_loader)
y_hat_test = np.array(torch.cat(y_hat_test)).astype(np.uint8)

Predicting: 6253it [00:00, ?it/s]

In [35]:
y_test = y_test.to_numpy().astype(np.uint8)

In [36]:
test_accuracy = accuracy_score(y_test, y_hat_test)
test_f1 = f1_score(y_test, y_hat_test, average='macro')

In [37]:
print(f'Accuracy : {test_accuracy}')
print(f'F1 Score : {test_f1}')

Accuracy : 0.941088265636621
F1 Score : 0.9053555194853412


As we can see, we got decent results, beating baseline model's performance