## Emotion Classification using Fine-tuned BERT model

In this tutorial, I will show to fine-tune a language model (LM) for emotion classification with code adapted from this [tutorial](https://zablo.net/blog/post/custom-classifier-on-bert-model-guide-polemo2-sentiment-analysis/) by MARCIN ZABŁOCKI. I adapted his tutorial and modified the code to suit the emotion classification task using a different BERT model. Please refer to his tutorial for more detailed explanations for each code block. I really liked his tutorial because of the attention to detail and the use of high-level libraries to take care of certain parts of the model such as training and finding a good learning rate.

Before you get started, make sure to enable `GPU` in the runtime and be sure to
restart the runtime in this environment after installing the `pytorch-lr-finder` library.

This tutorial is in a rough draft so if you find any issues with this tutorial or have any further questions reach out to me via [Twitter](https://twitter.com/omarsar0).

Note that the notebook was created a little while back so if something break it's because the code is not compatible with the library changes.


In [1]:
# %%capture
# !pip install transformers tokenizers pytorch-lightning torch-lr-finder

Note: you need to Restart runtime after running this code segment

In [2]:
# %%capture
# !git clone https://github.com/davidtvs/pytorch-lr-finder.git && cd pytorch-lr-finder && python setup.py install

In [None]:
import torch
from torch import nn
from typing import List
import torch.nn.functional as F
from transformers import DistilBertTokenizer, AutoTokenizer, AutoModelWithLMHead, DistilBertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
import logging
import os
from functools import lru_cache
from tokenizers import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
import pytorch_lightning as pl
from torch.utils.data import DataLoader, Dataset
import pandas as pd
from argparse import Namespace
from sklearn.metrics import classification_report
torch.__version__

## Load the Pretrained Language Model
We are first going to look at pretrained language model provided by HuggingFace models. We will use a variant of BERT, called DistilRoBERTa base. The `base` model has less parameters than the `larger` model.

[RoBERTa](https://arxiv.org/abs/1907.11692) is a variant of of BERT which "*modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates*".

Knowledge distillation help to train smaller LMs with similar performance and potential.

First, let's load the tokenizer for this model:

In [4]:
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')

Now let's load the actual model with the LM head that takes care of the prediciton for the LM. When fine-tuning we don't use the head and instead use the base model. The code below shows how to do this:

In [None]:
model = AutoModelWithLMHead.from_pretrained("distilroberta-base")
base_model = model.base_model

Let's now try out the tokenizer first:

In [6]:
text = "Elvis is the king of rock!"
enc = tokenizer.encode_plus(text)
enc.keys()

dict_keys(['input_ids', 'attention_mask'])

In [7]:
print(enc)

{'input_ids': [0, 9682, 9578, 16, 5, 8453, 9, 3152, 328, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


`input_ids` are the numerical encoding of the tokens in the vocabulary. `attention_mask` is an addition option used when batching sequences together and you want to tell the model which tokens should be attented to ([read more](https://huggingface.co/transformers/glossary.html#attention-mask)). The attention mask information helps when dealing with variance in the size of sequences and we need a way to tell the model that we don't want to attend to the padded indices of the sequence.

We are only using `input_ids` and `attention_mask`

We need to also unsqueeze to simulate batch processing

Using DistilBertForSequenceClassification: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification

In [8]:
out = base_model(torch.tensor(enc["input_ids"]).unsqueeze(0), torch.tensor(enc["attention_mask"]).unsqueeze(0))
out[0].shape

torch.Size([1, 10, 768])

In [9]:
## size of representation of one of the tokens
out[0][:,0,:].shape

torch.Size([1, 768])

`torch.Size([1, 768])` represents batch_size, number of tokens in input text (lenght of tokenized text), model's output hidden size.

In [10]:
t = "Elvis is the king of rock"
enc = tokenizer.encode_plus(t)
token_representations = base_model(torch.tensor(enc["input_ids"]).unsqueeze(0))[0][0]
print(enc["input_ids"])
print(tokenizer.decode(enc["input_ids"]))
print(f"Length: {len(enc['input_ids'])}")
print(token_representations.shape)

[0, 9682, 9578, 16, 5, 8453, 9, 3152, 2]
<s>Elvis is the king of rock</s>
Length: 9
torch.Size([9, 768])


## Building Custom Classification head on top of LM base model

Use Mish activiation function as in the one proposed in the original tutorial

In [11]:
# from https://github.com/digantamisra98/Mish/blob/b5f006660ac0b4c46e2c6958ad0301d7f9c59651/Mish/Torch/mish.py
@torch.jit.script
def mish(input):
    return input * torch.tanh(F.softplus(input))

class Mish(nn.Module):
    def forward(self, input):
        return mish(input)

The model we will use to do the fine-tuning

In [12]:
class EmoModel(nn.Module):
    def __init__(self, base_model, n_classes, base_model_output_size=768, dropout=0.05):
        super().__init__()
        self.base_model = base_model

        self.classifier = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(base_model_output_size, base_model_output_size),
            Mish(),
            nn.Dropout(dropout),
            nn.Linear(base_model_output_size, n_classes)
        )

        for layer in self.classifier:
            if isinstance(layer, nn.Linear):
                layer.weight.data.normal_(mean=0.0, std=0.02)
                if layer.bias is not None:
                    layer.bias.data.zero_()

    def forward(self, input_, *args):
        X, attention_mask = input_
        hidden_states = self.base_model(X, attention_mask=attention_mask)

        # maybe do some pooling / RNNs... go crazy here!

        # use the <s> representation
        return self.classifier(hidden_states[0][:, 0, :])

### Pretest the model with dummy text
We want to ensure that the model is returing the right information back.

In [13]:
classifier = EmoModel(AutoModelWithLMHead.from_pretrained("distilroberta-base").base_model, 3)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [14]:
X = torch.tensor(enc["input_ids"]).unsqueeze(0).to('cpu')
attn = torch.tensor(enc["attention_mask"]).unsqueeze(0).to('cpu')

In [15]:
classifier((X, attn))

tensor([[-0.1118, -0.1068,  0.0190]], grad_fn=<AddmmBackward0>)

## Prepare your dataset for fine-tuning

In [None]:
!mkdir -p tokenizer

In [17]:
## load pretrained tokenizer information
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.json',
 'tokenizer/merges.txt',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [None]:
!ls tokenizer

Implement CollateFN using fast tokenizers.
This function basically takes care of proper tokenization and batches of sequences. This way you don't need to create your batches manually. Find out more about Tokenizers [here](https://github.com/huggingface/tokenizers/tree/master/bindings/python).

In [None]:
class TokenizersCollateFn:
    def __init__(self, max_tokens=512):

        ## RoBERTa uses BPE tokenizer similar to GPT
        t = ByteLevelBPETokenizer(
            "tokenizer/vocab.json",
            "tokenizer/merges.txt"
        )
        t._tokenizer.post_processor = BertProcessing(
            ("</s>", t.token_to_id("</s>")),
            ("<s>", t.token_to_id("<s>")),
        )
        t.enable_truncation(max_tokens)
        t.enable_padding(length=max_tokens, pad_id=t.token_to_id("<pad>"))
        self.tokenizer = t

    def __call__(self, batch):
        encoded = self.tokenizer.encode_batch([x['text'] for x in batch])
        sequences_padded = torch.tensor([enc.ids for enc in encoded])
        attention_masks_padded = torch.tensor([enc.attention_mask for enc in encoded])
        labels = torch.tensor([x['label'] for x in batch])

        return (sequences_padded, attention_masks_padded), labels

## Getting the Data and Preview it
Below we are going to load the data and show you how to create the splits. However, we don't need to split the data manually becuase I have already created the splits and stored those files seperately which you can quickly download below:

In [20]:
from datasets import load_dataset

In [None]:
ds = load_dataset('emotion')
ds['validation'][19]

In [22]:
## emotion labels
label2int = {
  "sadness": 0,
  "joy": 1,
  "love": 2,
  "anger": 3,
  "fear": 4,
  "surprise": 5
}

emotions = [ "sadness", "joy", "love", "anger", "fear", "surprise"]

## Create the Dataset object

Create the Dataset object that will be used to load the different datasets.

In [23]:
class EmoDataset(Dataset):
    def __init__(self, path):
        super().__init__()
        self.data_column = "text"
        self.class_column = "class"
        self.data = pd.read_csv(path, sep=";", header=None, names=[self.data_column, self.class_column],
                               engine="python")

    def __getitem__(self, idx):
        return self.data.loc[idx, self.data_column], label2int[self.data.loc[idx, self.class_column]]

    def __len__(self):
        return self.data.shape[0]

## Training with PyTorchLightning

[PyTorchLightning](https://www.pytorchlightning.ai/) is a library that abstracts the complexity of training neural networks with PyTorch. It is built on top of PyTorch and simplifies training.

![](https://pytorch-lightning.readthedocs.io/en/latest/_images/pt_to_pl.png)

In [24]:
from torch_lr_finder import LRFinder

In [25]:
train_path = "train"
test_path = "test"
val_path = "validation"
val_path = "test"

In [26]:
import torch
# from codes import DictEnum, auto


# class Optimizer(DictEnum):
#     ADAM = auto()
#     KATE = auto()


class ADAM(torch.optim.Optimizer):
    def __init__(self, params, lr):  # lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0):
        defaults = dict(lr=lr)
        super().__init__(params, defaults)


    def step(self):
        loss = None

        for group in self.param_groups:

            for p in group['params']:
                grad = p.grad.data
                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state['step'] = 0
                    # Momentum (Exponential MA of gradients)
                    state['m'] = torch.zeros_like(p.data)

                    # RMS Prop componenet. (Exponential MA of squared gradients). Denominator.
                    state['v'] = torch.zeros_like(p.data)

                m, v = state['m'], state['v']

                beta1, beta2 = 0.9, 0.999
                state['step'] += 1

                # Add weight decay if any
                # if group['weight_decay'] != 0:
                #     grad = grad + group['weight_decay']*p.data

                # Momentum
                m = torch.mul(m, beta1) + (1 - beta1)*grad

                # RMS
                v = torch.mul(v, beta2) + (1-beta2)*(grad*grad)

                mhat = m / (1 - beta1 ** state['step'])
                vhat = v / (1 - beta2 ** state['step'])

                eps = 1e-4
                denom = torch.sqrt(vhat) + eps

                lr = group['lr']
                p.data = p.data - lr * mhat / denom

                # Save state
                state['m'], state['v'] = m, v

        return loss




class KATE(torch.optim.Optimizer): # delta 0 or 1e-8
    def __init__(self, params, lr):  # lr=1e-3, eta=0.9, eps=1e-8, delta=0, weight_decay=0):
        defaults = dict(lr=lr)
        super().__init__(params, defaults)

    def step(self):
        loss = None
        for group in self.param_groups:

            for p in group['params']:
                grad = p.grad.data
                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state['step'] = 0
                    state['m'] = torch.zeros_like(p.data)
                    state['b'] = torch.zeros_like(p.data)

                m, b = state['m'], state['b']
                eta = 0.001

                # if group['weight_decay'] != 0:
                #     grad = grad + group['weight_decay']*p.data


                g = grad*grad

                b = b + g
                eps = 1e-5
                denom = b + eps
                m = m + torch.mul(eta, g) + g / denom

                lr = group['lr']
                p.data = p.data - lr * torch.sqrt(m) * grad / denom

                # Save state
                state['m'], state['b'] = m, b
                state['step'] += 1

        return loss

In [27]:
import os
from mlflow import MlflowClient
from mlflow.entities import ViewType


class MLFlowLogger():
    def __init__(self):
        tracking_uri = os.path.expanduser('~/mlruns/')
        experiment_name = os.environ['MLFLOW_EXPERIMENT_NAME']
        self.c = MlflowClient(tracking_uri=tracking_uri)
        self.e = self.c.get_experiment_by_name(experiment_name)
        self.e_id = self.c.create_experiment(experiment_name)\
            if self.e is None else self.e.experiment_id

        self.enabled = True

    def check_exist(self, config):
        check_exist = eval(os.environ['MLFLOW_CHECK_EXIST'])
        if not check_exist:
            return False

        if self.e is not None:
            filter_string = list()
            for key in config.keys():
                if isinstance(config[key], dict):
                    value = config[key]['name']
                else:
                    value = config[key]
                filter_string.append(f'params.{key}="{value}"')
            filter_string.append('attributes.status="FINISHED"')
            tags = eval(os.environ['MLFLOW_RUN_TAGS'])
            for key, value in tags.items():
                filter_string.append(f'tags."{key}" = "{value}"')
            # print(f"{filter_string=}")
            filter_string = ' and '.join(filter_string)
            runs = self.c.search_runs(experiment_ids=[self.e.experiment_id],
                                      filter_string=filter_string,
                                      run_view_type=ViewType.ACTIVE_ONLY)
            if len(runs):
                return True
        return False

    def init(self, config):
        if not self.enabled:
            return

        r = self.c.create_run(experiment_id=self.e_id,
                              run_name=os.environ['MLFLOW_RUN_NAME'],
                              tags=eval(os.environ['MLFLOW_RUN_TAGS']))
        self.r_id = r.info.run_id
        self.c.log_dict(self.r_id, config, 'config.json')
        for key, value in config.items():
            if isinstance(value, dict):
                value = value['name']
            self.c.log_param(self.r_id, key, value)

    def log_metrics(self, metrics, step):
        if not self.enabled:
            return

        for key in metrics.keys():
            self.c.log_metric(self.r_id, key, metrics[key], step=step)
        # print('Step ' + '{}: '.format(step).rjust(6) +
              # ' '.join("{}: {:.5f}".format(k, v) for k, v in metrics.items() if 'weights' not in k), end='\r', flush=True)

    def terminate(self):
        if not self.enabled:
            return

        self.c.set_terminated(self.r_id)
        print()

os.environ['MLFLOW_VERBOSE'] = 'True'
# os.environ['MLFLOW_CHECK_EXIST'] = 'False'
os.environ['MLFLOW_CHECK_EXIST'] = 'True'
os.environ['MLFLOW_EXPERIMENT_NAME'] = os.path.basename(os.getcwd())

In [28]:
## Methods required by PyTorchLightning

class TrainingModule(pl.LightningModule):
    def __init__(self, hparams):
        super().__init__()
        self.model = EmoModel(AutoModelWithLMHead.from_pretrained("distilroberta-base").base_model, len(emotions))
        self.loss = nn.CrossEntropyLoss() ## combines LogSoftmax() and NLLLoss()
        #self.hparams = hparams
        self.hparams.update(vars(hparams))
        self.test_step_y_hats = []
        self.test_step_ys = []

        config = vars(hparams)
        self.loggger = MLFlowLogger()
        if self.loggger.check_exist(config):
            return
        self.loggger.enabled = eval(os.environ['MLFLOW_VERBOSE'])
        self.loggger.init(config)
        self.loggger.step = 0


    def step(self, batch, step_name="train"):
        X, y = batch
        pred = self.forward(X)
        loss = self.loss(pred, y)
        # loss_key = f"{step_name}_loss"
        # tensorboard_logs = {loss_key: loss}

        self.log("train_loss", loss)
        return loss

        # return { ("loss" if step_name == "train" else loss_key): loss, 'log': tensorboard_logs,
               # "progress_bar": {loss_key: loss}}
        # return pred, y

    def forward(self, X, *args):
        return self.model(X, *args)

    def training_step(self, batch, batch_idx):
        return self.step(batch, "train")

    # def validation_step(self, batch, batch_idx):
    #     # return self.step(batch, "val")
    #     X, y = batch
    #     pred = self.forward(X)
    #     loss = self.loss(pred, y)
        # self.log("val_loss", val_loss)

    # def validation_end(self, outputs: List[dict]):
    #     loss = torch.stack([x["val_loss"] for x in outputs]).mean()
    #     # return {"val_loss": loss}

    # def test_step(self, batch, batch_idx):
    #     x, y = batch
    #     logits = self(x)
    #     # self.test_acc(logits, y)
    #     # self.log('test_acc', self.test_acc, on_step=False, on_epoch=True)
    #     preds = logits.argmax(dim=-1)
    #     acc = (y == preds).float().mean()
    #     return acc
    #     # print(acc)
    def test_step(self, batch, batch_idx):
        return self.step(batch, "test")

    def validation_step(self, batch, batch_idx):
        X, y = batch
        logits = self.forward(X)
        loss = self.loss(logits, y)
#         loss_key = f"{step_name}_loss"
#         tensorboard_logs = {loss_key: loss}

#         return { ("loss" if step_name == "train" else loss_key): loss, 'log': tensorboard_logs,
#                "progress_bar": {loss_key: loss}}
        # return pred, y
        self.log("val_loss", loss)
        self.test_step_y_hats.append(logits)
        self.test_step_ys.append(y)

    def on_validation_epoch_end(self):
        # all_outputs is a list of [targets, logits] tuples from each test.
        acc = 0
        for logits, y in zip(self.test_step_y_hats, self.test_step_ys):
            preds = logits.argmax(dim=-1)
            acc += (y == preds).float().mean()
            # print(acc)
            # do something
        # calculate metrics based on all outputs from test.
        # log the results.
        acc /= len(self.test_step_y_hats)
        # print('acc', acc)
        # log = {}
        # log.update({'Accuracy': acc})
        # self.log_dict(log)
        self.test_step_y_hats = []
        self.test_step_ys = []

        if self.loggger.enabled:
            self.loggger.log_metrics({'accuracy': acc}, self.loggger.step)
            self.loggger.step += 1

    def train_dataloader(self):
        return self.create_data_loader(self.hparams.train_path, shuffle=True)

    def val_dataloader(self):
        return self.create_data_loader(self.hparams.val_path)

    def test_dataloader(self):
        return self.create_data_loader(self.hparams.test_path)

    def create_data_loader(self, ds_path: str, shuffle=False):
        return DataLoader(
                    ds[ds_path],
                    batch_size=self.hparams.batch_size,
                    shuffle=shuffle,
                    collate_fn=TokenizersCollateFn()
        )

    @lru_cache()
    def total_steps(self):
        return len(self.train_dataloader()) // self.hparams.accumulate_grad_batches * self.hparams.epochs

    def configure_optimizers(self):
        ## use AdamW optimizer -- faster approach to training NNs
        ## read: https://www.fast.ai/2018/07/02/adam-weight-decay/
        # optimizer = AdamW(self.model.parameters(), lr=self.hparams.lr)
        # optimizer = KATE(self.model.parameters(), lr=self.hparams.lr)
        optimizer = ADAM(self.model.parameters(), lr=self.hparams.lr)
        lr_scheduler = get_linear_schedule_with_warmup(
                    optimizer,
                    num_warmup_steps=self.hparams.warmup_steps,
                    num_training_steps=self.total_steps(),
        )
        return [optimizer], [{"scheduler": lr_scheduler, "interval": "step"}]

## Finding Learning rate for the model

The code below aims to obtain valuable information about the optimal learning rate during a pretraining run. Determine boundary and increase the leanring rate linearly or exponentially.

More: https://github.com/davidtvs/pytorch-lr-finder

In [29]:
# lr=0.1 ## uper bound LR
# from torch_lr_finder import LRFinder
# hparams_tmp = Namespace(
#     train_path=train_path,
#     val_path=val_path,
#     test_path=test_path,
#     batch_size=16,
#     warmup_steps=100,
#     epochs=1,
#     lr=lr,
#     accumulate_grad_batches=1,
# )
# module = TrainingModule(hparams_tmp)
# criterion = nn.CrossEntropyLoss()
# optimizer = AdamW(module.parameters(), lr=5e-7) ## lower bound LR
# # lr_finder = LRFinder(module, optimizer, criterion, device="cuda")
# # lr_finder = LRFinder(module, optimizer, criterion, device="cuda")
# lr_finder.range_test(module.train_dataloader(), end_lr=100, num_iter=100, accumulation_steps=hparams_tmp.accumulate_grad_batches)
# lr_finder.plot()
# lr_finder.reset()

In [30]:
lr = 1e-4 #  the value from the original code user for Adam
lr

# lr = 1e-5 #  stepsize for KATE

0.0001

In [31]:
# lr_finder.plot(show_lr=lr)

## Training the Emotion Classifier

In [None]:
hparams = Namespace(
    train_path=train_path,
    val_path=val_path,
    test_path=test_path,
    batch_size=64,
    warmup_steps=100,
    epochs=10,
    lr=lr,
    accumulate_grad_batches=1
)
os.environ['MLFLOW_RUN_TAGS'] = str(dict(about=f'tuned'))
os.environ['MLFLOW_RUN_NAME'] = 'Adam'
# os.environ['MLFLOW_RUN_NAME'] = 'KATE'
# os.environ['MLFLOW_RUN_NAME'] = 'adam my'
module = TrainingModule(hparams)

In [33]:
os.environ["CUDA_VISIBLE_DEVICES"] = "3"

In [None]:
torch.set_float32_matmul_precision('medium')
trainer = pl.Trainer(accelerator="auto", max_epochs=hparams.epochs,
                     accumulate_grad_batches=hparams.accumulate_grad_batches, precision='16-mixed', check_val_every_n_epoch=1)
trainer.fit(module)