# Emotion analysis with BERT

Using transformers with the distilled bert-base model on the emobank dataset, to perform emotion analysis based on the circumplex model.

Written by Luc Bijl.

Retrieving the emobank training and testing dataset from the datasets directory.

In [1]:
import os
import pandas as pd

emobank_dataset = "../../datasets/emobank/emobank.csv"

df_emobank = pd.read_csv(emobank_dataset, header=None, names=['Id', 'Split', 'V', 'A', 'D', 'Text'])

df_raw_train = df_emobank[df_emobank['Split'] == 'train'].drop(columns=['Id', 'Split']).reset_index(drop=True)
df_raw_test = df_emobank[df_emobank['Split'] == 'test'].drop(columns=['Id', 'Split']).reset_index(drop=True)

print("Train Data:")
print(df_raw_train.head())
print("\nTest Data:")
print(df_raw_test.head())

Train Data:
      V     A     D                                               Text
0   3.0   3.0   3.2        Remember what she said in my last letter? "
1   3.0   3.0   3.0                                                .."
2  3.44   3.0  3.22  Goodwill helps people get off of public assist...
3  3.55  3.27  3.46  Sherry learned through our Future Works class ...
4   3.6   3.3   3.8  Coming to Goodwill was the first step toward m...

Test Data:
      V     A     D                                               Text
0   2.8   3.1   2.8                          If I wasn't working here.
1  3.27  3.36  3.36      I've got more than a job; I've got a career."
2  2.86  3.29  3.29                           He has no time to waste.
3   3.4   3.1   3.4  With the help of friends like you, Goodwill ha...
4   3.0   2.6   3.1                                      Real results.


Normalizing the training and testing dataset to a range of -1 to 1.

In [2]:
def normalize(n):
    normal_n = (2*n - 5) / 5
    return normal_n

df_train = df_raw_train[['Text']].copy()
df_test = df_raw_test[['Text']].copy()

for i in ['V','A','D']:
    df_train[i] = normalize(df_raw_train[i].astype(float))
    df_test[i] = normalize(df_raw_test[i].astype(float))

print("Train Data:")
print(df_train.head())
print("\nTest Data:")
print(df_test.head())

Train Data:
                                                Text      V      A      D
0        Remember what she said in my last letter? "  0.200  0.200  0.280
1                                                .."  0.200  0.200  0.200
2  Goodwill helps people get off of public assist...  0.376  0.200  0.288
3  Sherry learned through our Future Works class ...  0.420  0.308  0.384
4  Coming to Goodwill was the first step toward m...  0.440  0.320  0.520

Test Data:
                                                Text      V      A      D
0                          If I wasn't working here.  0.120  0.240  0.120
1      I've got more than a job; I've got a career."  0.308  0.344  0.344
2                           He has no time to waste.  0.144  0.316  0.316
3  With the help of friends like you, Goodwill ha...  0.360  0.240  0.360
4                                      Real results.  0.200  0.040  0.240


Printing the most extreme sentences in the training set in either of the three dimensions.

In [3]:
for i in ['V','A','D']:
    print("Min {}:\n{}".format(i, df_train.loc[df_train[i].argmin()]))
    print()
    print("Max {}:\n{}".format(i, df_train.loc[df_train[i].argmax()]))
    print()
    print()

Min V:
Text    "Fuck you"
V            -0.52
A             0.68
D             0.52
Name: 930, dtype: object

Max V:
Text    lol Wonderful Simply Superb!
V                               0.84
A                               0.72
D                               0.48
Name: 7695, dtype: object


Min A:
Text    I was feeling calm and private that night.
V                                             0.24
A                                            -0.28
D                                             0.24
Name: 2859, dtype: object

Max A:
Text    "My God, yes, yes, yes!"
V                           0.72
A                           0.76
D                           0.36
Name: 6270, dtype: object


Min D:
Text    Hands closed on my neck and I felt my spine cr...
V                                                   -0.24
A                                                    0.52
D                                                    -0.2
Name: 3373, dtype: object

Max D:
Text    “NO”
V      -0.32
A   

Determining the length of the training and testing dataset, to set a proper batch size.

In [4]:
print(f"Length training set: {len(df_train)}\nLength testing set: {len(df_test)}")

Length training set: 8062
Length testing set: 1000


Testing if the GPU supports torch.

In [5]:
import torch

torch.cuda.is_available()

True

Preparing the data for BERT, this includes tokenization, encoding and creating dataloaders for both training and testing datasets.

In [6]:
from transformers import DistilBertTokenizerFast
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

# Tokenizing and encoding the text data
train_encodings = tokenizer(df_train['Text'].tolist(), truncation=True, padding=True, return_tensors='pt')
test_encodings = tokenizer(df_test['Text'].tolist(), truncation=True, padding=True, return_tensors='pt')

# Creating data loaders
train_dataset = torch.utils.data.TensorDataset(
    train_encodings['input_ids'], 
    train_encodings['attention_mask'], 
    torch.tensor(df_train[['V', 'A', 'D']].values, dtype=torch.float32)
)
train_dataloader = DataLoader(train_dataset, batch_size=29, shuffle=True)

test_dataset = torch.utils.data.TensorDataset(
    test_encodings['input_ids'], 
    test_encodings['attention_mask'], 
    torch.tensor(df_test[['V', 'A', 'D']].values, dtype=torch.float32)
)
test_dataloader = DataLoader(test_dataset, batch_size=20, shuffle=False)

  torch.utils._pytree._register_pytree_node(


Defining the model: distilBERT.

In [7]:
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Defining the optimizer and loss function.

In [8]:
from torch.optim import Adam
from torch.nn import L1Loss

optimizer = Adam(model.parameters(), lr=1e-5)
loss_fn = L1Loss()

Defining the training loop, here BERT will be trained with the training dataset and validated with the test dataset.

In [9]:
from torch.utils.tensorboard import SummaryWriter

log_dir = 'logs-4'
writer = SummaryWriter(log_dir)
global_step = 0

num_epochs = 30

early_stop_patience = 2
best_validation_loss = float('inf')
no_improvement_counter = 0

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    total_loss_v = 0
    total_loss_a = 0
    total_loss_d = 0
    num_batches = 0

    for batch in train_dataloader:
        global_step += 1
        num_batches += 1
        input_ids, attention_mask, target_scores = batch

        # Forward pass
        output = model(input_ids=input_ids, attention_mask=attention_mask)
        predicted_scores = output.logits

        # Calculating the loss for each dimensions
        loss_v = loss_fn(predicted_scores[:,0], target_scores[:,0])
        loss_a = loss_fn(predicted_scores[:,1], target_scores[:,1])
        loss_d = loss_fn(predicted_scores[:,2], target_scores[:,2])

        # The main loss is defined as the sum of the individual losses
        loss = loss_v + loss_a + loss_d

        # The total loss per epoch
        total_loss_v += loss_v.item()
        total_loss_a += loss_a.item()
        total_loss_d += loss_d.item()
        total_loss += loss.item()

        # Determining the average loss in the epoch
        average_loss_v = total_loss_v / num_batches
        average_loss_a = total_loss_a / num_batches
        average_loss_d = total_loss_d / num_batches
        average_loss = total_loss / num_batches

        # Tensorboard logging
        writer.add_scalar('Batch-loss-train-valence', average_loss_v, global_step)
        writer.add_scalar('Batch-loss-train-arousal', average_loss_a, global_step)
        writer.add_scalar('Batch-loss-train-dominance', average_loss_d, global_step)
        writer.add_scalar('Batch-loss-train', average_loss, global_step)

        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # Determining the average loss for the epoch
    average_loss_v = total_loss_v / len(train_dataloader)
    average_loss_a = total_loss_a / len(train_dataloader)
    average_loss_d = total_loss_d / len(train_dataloader)
    average_loss = total_loss / len(train_dataloader)
    
    # Logging
    writer.add_scalar('Epoch-loss-train-valence', average_loss_v, epoch + 1)
    writer.add_scalar('Epoch-loss-train-arousal', average_loss_a, epoch + 1)
    writer.add_scalar('Epoch-loss-train-dominance', average_loss_d, epoch + 1)
    writer.add_scalar('Epoch-loss-train', average_loss, epoch + 1)
    print(f"Epoch {epoch + 1}/{num_epochs}, Training Loss: {average_loss:.4f}")
    
    # Validation
    model.eval()
    total_loss = 0
    total_loss_v = 0
    total_loss_a = 0
    total_loss_d = 0

    for batch in test_dataloader:
        with torch.no_grad():
            input_ids, attention_mask, target_scores = batch

            # Obtaining the scores
            output = model(input_ids=input_ids, attention_mask=attention_mask)
            predicted_scores = output.logits

            # Calculating the loss for each dimensions
            loss_v = loss_fn(predicted_scores[:,0], target_scores[:,0])
            loss_a = loss_fn(predicted_scores[:,1], target_scores[:,1])
            loss_d = loss_fn(predicted_scores[:,2], target_scores[:,2])
            
            # The main loss is defined as the sum of the individual losses
            loss = loss_v + loss_a + loss_d

            # The total loss per epoch
            total_loss_v += loss_v.item()
            total_loss_a += loss_a.item()
            total_loss_d += loss_d.item()
            total_loss += loss.item()

    # Determining the average loss for the epoch
    average_loss_v = total_loss_v / len(test_dataloader)
    average_loss_a = total_loss_a / len(test_dataloader)
    average_loss_d = total_loss_d / len(test_dataloader)
    average_loss = total_loss / len(test_dataloader)

    # Logging  
    writer.add_scalar('Epoch-loss-validation-valence', average_loss_v, epoch + 1)
    writer.add_scalar('Epoch-loss-validation-arousal', average_loss_a, epoch + 1)
    writer.add_scalar('Epoch-loss-validation-dominance', average_loss_d, epoch + 1)
    writer.add_scalar('Epoch-loss-validation', average_loss, epoch + 1)
    print(f"Epoch {epoch + 1}/{num_epochs}, Validation Loss: {average_loss:.4f}\n")

    # Saving the model
    torch.save(model, f'bert-emobank-4/{epoch + 1}.pth')

    # Early stopping check
    if average_loss < best_validation_loss:
        best_validation_loss = average_loss
        no_improvement_counter = 0
    else:
        no_improvement_counter += 1

    if no_improvement_counter >= early_stop_patience:
        break

writer.close()

Epoch 1/30, Training Loss: 0.2365
Epoch 1/30, Validation Loss: 0.1970

Epoch 2/30, Training Loss: 0.1965
Epoch 2/30, Validation Loss: 0.1871

Epoch 3/30, Training Loss: 0.1843
Epoch 3/30, Validation Loss: 0.1854

Epoch 4/30, Training Loss: 0.1754
Epoch 4/30, Validation Loss: 0.1853

Epoch 5/30, Training Loss: 0.1654
Epoch 5/30, Validation Loss: 0.1864

Epoch 6/30, Training Loss: 0.1581
Epoch 6/30, Validation Loss: 0.1851

Epoch 7/30, Training Loss: 0.1513
Epoch 7/30, Validation Loss: 0.1869

Epoch 8/30, Training Loss: 0.1446
Epoch 8/30, Validation Loss: 0.1876



Loading a version of the model.

In [16]:
model = torch.load('bert-emobank-4/6.pth')

Evaluating the model, with as output the MAE, MSE and R-value.

In [17]:
from scipy.stats import pearsonr
from sklearn.metrics import mean_squared_error, mean_absolute_error

model.eval()
list_predicted_scores = {'V': [], 'A': [], 'D': []}

for batch in test_dataloader:
    with torch.no_grad():
        input_ids, attention_mask, target_scores = batch

        # Obtaining the scores
        output = model(input_ids=input_ids, attention_mask=attention_mask)
        predicted_scores = output.logits

        # Writing the scores to the list
        list_predicted_scores['V'].extend(predicted_scores[:, 0].tolist())
        list_predicted_scores['A'].extend(predicted_scores[:, 1].tolist())
        list_predicted_scores['D'].extend(predicted_scores[:, 2].tolist())

# Inserting the scores in df_test
for i,j in zip(['V', 'A', 'D'],['V-p', 'A-p', 'D-p']):
    df_test[j] = list_predicted_scores[i]

# Computing the R, MSE and MAE values.
for i,j in zip(['V', 'A', 'D'],['V-p', 'A-p', 'D-p']):

    correlation, _ = pearsonr(df_test[i], df_test[j])

    print(f"Pearson Correlation Coefficient (R) {i}: {correlation:.4f}")
    print(f"Mean Squared Error (MSE) {i}: {mean_squared_error(df_test[i], df_test[j]):.4f}")
    print(f"Mean Absolute Error (MAE) {i}: {mean_absolute_error(df_test[i], df_test[j]):.4f}")
    print()

Pearson Correlation Coefficient (R) V: 0.7740
Mean Squared Error (MSE) V: 0.0077
Mean Absolute Error (MAE) V: 0.0638

Pearson Correlation Coefficient (R) A: 0.5412
Mean Squared Error (MSE) A: 0.0071
Mean Absolute Error (MAE) A: 0.0655

Pearson Correlation Coefficient (R) D: 0.4805
Mean Squared Error (MSE) D: 0.0055
Mean Absolute Error (MAE) D: 0.0558



Evaluating the summary statistics.

In [18]:
df_test[['V', 'V-p', 'A', 'A-p', 'D', 'D-p']].describe()

Unnamed: 0,V,V-p,A,A-p,D,D-p
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,0.191236,0.196919,0.213576,0.210068,0.225812,0.230005
std,0.138137,0.107967,0.098377,0.068097,0.084154,0.042859
min,-0.372,-0.195293,-0.16,0.078601,-0.288,0.064148
25%,0.12,0.149089,0.156,0.167158,0.2,0.207812
50%,0.2,0.210195,0.2,0.195651,0.24,0.230044
75%,0.256,0.251653,0.28,0.239414,0.28,0.254736
max,0.72,0.695105,0.68,0.636425,0.6,0.416937


Printing the most extreme sentences in the test set in either of the six dimensions.

In [19]:
for i in ['V', 'V-p', 'A', 'A-p', 'D', 'D-p']:
    print("Min {}:\n{}".format(i, df_test.loc[df_test[i].argmin()]))
    print()
    print("Max {}:\n{}".format(i, df_test.loc[df_test[i].argmax()]))
    print()
    print()

Min V:
Text    Bangladesh ferry sinks, 15 dead
V                                -0.372
A                                   0.2
D                                 0.028
V-p                           -0.119922
A-p                            0.279011
D-p                            0.114692
Name: 526, dtype: object

Max V:
Text    “That’s amazing!”
V                    0.72
A                    0.68
D                     0.4
V-p              0.695105
A-p              0.597201
D-p              0.319575
Name: 382, dtype: object


Min V-p:
Text    Bomb kills 18 on military bus in Iran
V                                       -0.08
A                                        0.28
D                                        0.08
V-p                                 -0.195293
A-p                                  0.364857
D-p                                  0.122096
Name: 461, dtype: object

Max V-p:
Text    “That’s amazing!”
V                    0.72
A                    0.68
D                     0.4
V