# Emotion Classification with BERT

This notebook is based in part on "Multi Class Text Classification With Deep Learning Using BERT" by Susan Li, which can be found here: https://towardsdatascience.com/multi-class-text-classification-with-deep-learning-using-bert-b59ca2f5c613.

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
import pandas as pd
import transformers
import torch
import tqdm
import os
import io
from tqdm.notebook import trange, tqdm

## Initializing the Tokenizer and Reading in the CSV

In [None]:
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [None]:
column_names = ['Emotion', 'Text']
train = pd.read_csv('./drive/My Drive/Teams_Lab/isear-train.csv', names=column_names, header=None, on_bad_lines='skip')
train = train.dropna()
pattern = "\[(.*?)\]" # Regex to filter out non-responses
filtered = train['Text'].str.contains(pattern)
train = train[~filtered]
# Replaces EOS punctuation with SEP token
train['Text']=train['Text'].apply(lambda x: str(x).replace('. ', f'. {tokenizer.sep_token} ')).apply(lambda x: str(x).replace('! ', f'! {tokenizer.sep_token} ')).apply(lambda x: str(x).replace('? ', f'? {tokenizer.sep_token} '))

  """


In [None]:
print("Dimensions are: ", train.shape)
train.head(25)

Dimensions are:  (5193, 2)


Unnamed: 0,Emotion,Text
0,joy,When I understood that I was admitted to the U...
1,fear,I broke a window of a neighbouring house and I...
2,joy,Got a big fish in fishing.
3,fear,"Whenever I am alone in a dark room, walk alone..."
4,shame,I bought a possible answer to a homework probl...
5,disgust,I read about a murderer who brutalized his vic...
6,joy,The day that my boyfriend appeared at home wit...
7,guilt,I went to a pub with a group of friends (not v...
8,anger,Had an insulting letter from my father.
10,fear,I was to be given an audition to get a role. [...


## Reading test file

In [None]:
test = pd.read_csv('./drive/My Drive/Teams_Lab/isear-test.csv', names=column_names, header=None, on_bad_lines='skip')
test = test.dropna()
pattern = "\[(.*?)\]"
filtered = test['Text'].str.contains(pattern)
test = test[~filtered]
test['Text']=test['Text'].apply(lambda x: str(x).replace('. ', f'. {tokenizer.sep_token} ')).apply(lambda x: str(x).replace('! ', f'! {tokenizer.sep_token} ')).apply(lambda x: str(x).replace('? ', f'? {tokenizer.sep_token} '))




  after removing the cwd from sys.path.


In [None]:
!pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Text Pre-processing: Removing Contractions

In [None]:
import string
import contractions
txt = "I'd love to grab dinner sometime. I hadn't considered you'd go, but it's probably for the best that you're there."

def clean_text(txt):
    expanded_toks = []
    for word in txt.split():
        expanded_toks.append(contractions.fix(word))
    cleaned_text = ' '.join(expanded_toks)

    #cleaned_text = ' '.join(word.strip(string.punctuation) for word in cleaned_text.split())

    return cleaned_text

print(clean_text(txt))

I would love to grab dinner sometime. I had not considered you would go, but it is probably for the best that you are there.


In [None]:
train['Text'] = train['Text'].apply(clean_text)
test['Text'] = test['Text'].apply(clean_text)

In [None]:
train['len_text'] = train['Text'].astype(str).apply(lambda x: len(x.split()))

In [None]:
print(train['Text'][12])
print("Tokens: \n", tokenizer.tokenize(train['Text'][12], add_special_tokens=True))
print("Token IDS: \n", tokenizer.convert_tokens_to_ids(tokenizer.tokenize(train['Text'][12])))

Keyword arguments {'add_special_tokens': True} not recognized.


When I see that my mother forces my little brother (15 years) to work very hard for school. [SEP] I do not agree that she constantly puts him to work, but I do not tell it to her, so nothing changes.
Tokens: 
 ['when', 'i', 'see', 'that', 'my', 'mother', 'forces', 'my', 'little', 'brother', '(', '15', 'years', ')', 'to', 'work', 'very', 'hard', 'for', 'school', '.', '[SEP]', 'i', 'do', 'not', 'agree', 'that', 'she', 'constantly', 'puts', 'him', 'to', 'work', ',', 'but', 'i', 'do', 'not', 'tell', 'it', 'to', 'her', ',', 'so', 'nothing', 'changes', '.']
Token IDS: 
 [2043, 1045, 2156, 2008, 2026, 2388, 2749, 2026, 2210, 2567, 1006, 2321, 2086, 1007, 2000, 2147, 2200, 2524, 2005, 2082, 1012, 102, 1045, 2079, 2025, 5993, 2008, 2016, 7887, 8509, 2032, 2000, 2147, 1010, 2021, 1045, 2079, 2025, 2425, 2009, 2000, 2014, 1010, 2061, 2498, 3431, 1012]


In [None]:
# Removing out-of-label datapoints
emots_train = {i for i in set(train["Emotion"]) if len(i)<10}
emots_test = {i for i in set(test["Emotion"]) if len(i)<10}

print(emots_train)
print(emots_test)


{'joy', 'shame', 'guilt', 'anger', 'sadness', 'fear', 'disgust'}
{'shame', 'joy', 'guilt', 'anger', 'sadness', 'fear', 'disgust'}


In [None]:
train = train[train['Emotion'].isin(emots_train)]
test = test[test['Emotion'].isin(emots_test)]

## Tokenization

In [None]:
# Determining max length
max_len = 0
for text in train['Text']:
    input_ids = tokenizer.encode(text, add_special_tokens=True)
    max_len = max(max_len, len(input_ids))

In [None]:
# Obtaining input ids and attention masks for train
text = train['Text'].values
emos = train['Emotion'].values

encoded = tokenizer.batch_encode_plus(
  text,
  add_special_tokens=True,
  padding="max_length",
  max_length=max_len,
  pad_to_max_length=True,
  return_token_type_ids=True,
  return_attention_mask=True,
  return_tensors='pt'
)

input_ids = encoded['input_ids']
attention_masks = encoded['attention_mask']
token_type_ids = encoded['token_type_ids']

In [None]:
# Obtaining input ids and attention masks for test
test_text = test['Text'].values
test_emotion = test['Emotion'].values

tencoded = tokenizer.batch_encode_plus(
  test_text,
  add_special_tokens=True,
  padding='max_length',
  max_length=max_len,
  pad_to_max_length=True,
  return_token_type_ids=True,
  return_attention_mask=True,
  return_tensors='pt'
)

tinput_ids = tencoded['input_ids']
tattention_masks = tencoded['attention_mask']
ttoken_type_ids = tencoded['token_type_ids']

In [None]:
print(tokenizer.decode(input_ids[11]))

[CLS] when i see that my mother forces my little brother ( 15 years ) to work very hard for school. [SEP] i do not agree that she constantly puts him to work, but i do not tell it to her, so nothing changes. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

## Label encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
train['encoded_lab'] = labelencoder.fit_transform(train['Emotion'])
test['encoded_lab'] = labelencoder.fit_transform(test['Emotion'])

In [None]:
label_dict = {l: i for i, l in enumerate(labelencoder.classes_)}

In [None]:
print(train.head(10))
print(test.head(10))


    Emotion                                               Text  len_text  \
0       joy  When I understood that I was admitted to the U...        10   
1      fear  I broke a window of a neighbouring house and I...        21   
2       joy                         Got a big fish in fishing.         6   
3      fear  Whenever I am alone in a dark room, walk alone...        46   
4     shame  I bought a possible answer to a homework probl...        26   
5   disgust  I read about a murderer who brutalized his vic...        19   
6       joy  The day that my boyfriend appeared at home wit...        16   
7     guilt  I went to a pub with a group of friends (not v...        45   
8     anger            Had an insulting letter from my father.         7   
10     fear  I was to be given an audition to get a role. [...        26   

    encoded_lab  
0             4  
1             2  
2             4  
3             2  
4             6  
5             1  
6             4  
7             3  
8

In [None]:
labels_train = torch.tensor(train['encoded_lab'].values)
labels_test = torch.tensor(test['encoded_lab'].values)

In [None]:
from torch.utils.data import TensorDataset

## Building the datasets

In [None]:
dataset_train = TensorDataset(input_ids, attention_masks, labels_train)

In [None]:
dataset_test = TensorDataset(tinput_ids, tattention_masks, labels_test)

## Initializing BERT model

In [None]:
from transformers import BertForSequenceClassification

In [None]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', 
     num_labels=7,
     output_attentions=False,
     output_hidden_states=False)

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [None]:
# Setting batch size and dataloaders

batch_size = 64
dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)


dataloader_val = DataLoader(
    dataset_test,
    sampler=SequentialSampler(dataset_test),
    batch_size=batch_size 
)

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [None]:
# Using the AdamW optimizer
optimizer = AdamW(
    model.parameters(),
    lr=1e-5,
    eps=1e-8
)



In [None]:
# Setting number of epochs and setting the scheduler
epochs = 6

scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=len(dataloader_train)*epochs
)

In [None]:
import numpy as np
from sklearn.metrics import f1_score

## Evaluation metrics

In [None]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis =1 ).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [None]:
def accuracy_per_class(preds, labels):
    label_dict_inverse={v: k for k, v in label_dict.items()}
    preds_flat = np.argmax(preds, axis =1 ).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_pred = preds_flat[labels_flat== label]
        y_true = labels_flat[labels_flat== label]
        print(f'Class:{label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_pred[y_pred==label])}/{len(y_true)}\n')

In [None]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [None]:
def evaluate(dataloader_val):
    model.eval()
    loss_val_total = 0
    predictions, true_vals = [], []
    for batch in tqdm(dataloader_val):
        batch = tuple(b.to(device) for b in batch)
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals


## Training Loop

In [None]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch),
                        leave=False,
                        disable=False)
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs ={
            'input_ids'    :batch[0],
            'attention_mask':batch[1],
            'labels'        :batch[2]
        }
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()
    
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        progress_bar.set_postfix(
            {'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
        
    tqdm.write('\nEpoch {epoch}')
    
    loss_train_avg= loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss:{loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1= f1_score_func(predictions,true_vals)
    tqdm.write(f'Validation{val_loss}')
    tqdm.write(f'F1 Score (weigthed): {val_f1}')  

  0%|          | 0/6 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/82 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.5478803082937147


  0%|          | 0/18 [00:00<?, ?it/s]

Validation0.9331372479597727
F1 Score (weigthed): 0.6976626232994025


Epoch 2:   0%|          | 0/82 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.4386812739255952


  0%|          | 0/18 [00:00<?, ?it/s]

Validation0.9556000729401907
F1 Score (weigthed): 0.6954860323360449


Epoch 3:   0%|          | 0/82 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.35970777781998237


  0%|          | 0/18 [00:00<?, ?it/s]

Validation0.9954851236608293
F1 Score (weigthed): 0.7004788178984461


Epoch 4:   0%|          | 0/82 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.30917848773845813


  0%|          | 0/18 [00:00<?, ?it/s]

Validation1.0155504246552784
F1 Score (weigthed): 0.7017857242911341


Epoch 5:   0%|          | 0/82 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.2879343203655103


  0%|          | 0/18 [00:00<?, ?it/s]

Validation1.0370522903071508
F1 Score (weigthed): 0.6978983270061822


Epoch 6:   0%|          | 0/82 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.2705233032565291


  0%|          | 0/18 [00:00<?, ?it/s]

Validation1.0311903953552246
F1 Score (weigthed): 0.6981580720162264


## Saving the model

In [None]:
output_dir = f"./bert-base-uncased/"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print(f"Saving model to {output_dir}")

model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

Saving model to ./bert-base-uncased/


('./bert-base-uncased/tokenizer_config.json',
 './bert-base-uncased/special_tokens_map.json',
 './bert-base-uncased/vocab.txt',
 './bert-base-uncased/added_tokens.json')

In [None]:
target_dir = f"\"./drive/My Drive/Teams_Lab\""
!cp -r $output_dir $target_dir

## Loading the Saved Model and Making Predictions

In [None]:
'''model = BertForSequenceClassification.from_pretrained(f"./drive/My Drive/Teams_Lab/bert-base-uncased/",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)'''

model.to(device)

_, predictions, true_vals = evaluate(dataloader_val)
accuracy_per_class(predictions, true_vals)

  0%|          | 0/18 [00:00<?, ?it/s]

Class:anger
Accuracy:109/171

Class:disgust
Accuracy:103/169

Class:fear
Accuracy:130/164

Class:guilt
Accuracy:100/148

Class:joy
Accuracy:138/160

Class:sadness
Accuracy:104/149

Class:shame
Accuracy:96/156

