# Fine-Tuning Pre-trained Language Model (PLM) for Sentiment Analysis

<img src='https://media.geeksforgeeks.org/wp-content/uploads/20230802120409/Single-Sentence-Classification-Task.png'>

IMDB database will be used for sentiment analysis, the model should identify ***positive*** or ***negative*** reviews.




In [None]:
import torch
import pickle
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BartTokenizer, BartForSequenceClassification
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

from tabulate import tabulate
from tqdm import trange
import random

## Load the Dataset

The following cells will download the IMDB movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/) for positive-negative sentiment classification in as CSV-formatted file:

In [None]:
df = pd.read_csv('movie_data.csv')
print(len(df))
df.head()

In [None]:
df.groupby(['sentiment']).size().plot.bar()

In [None]:
text = df.review.values
labels = df.sentiment.values
print(text[652])
print(labels[652])

*The data set must be divided into a training set (80%) and a validation set (20%) without shuffling.* The sample notebook of following code is from following URL:

- https://colab.research.google.com/drive/1yxLJzohAIg5Xma5i7nTP4b_9ZtFp_YAy#scrollTo=D0FoX3-31HKg


## Download BartTokenizer

In [None]:
tokenizer = BartTokenizer.from_pretrained(
    'facebook/bart-base',
    do_lower_case = True
    )

There are 50265 words in vocabulary

In [None]:
tokenizer.vocab_size

Tokens with Token ID

In [None]:
def print_rand_sentence():
  '''Displays the tokens and respective IDs of a random text sample'''
  index = random.randint(0, len(text)-1)
  table = np.array([tokenizer.tokenize(text[index]),
                    tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text[index]))]).T
  print(tabulate(table,
                headers = ['Tokens', 'Token IDs'],
                tablefmt = 'fancy_grid'))

print_rand_sentence()

## Pre-processing input
<p>BART model use more noising than BART model
<p>The noising contain: </p>
<img src = 'https://miro.medium.com/v2/resize:fit:828/format:webp/1*SxfY1s5AgxyXAoA8k0JqSA.png'>


In [None]:
token_id = []
attention_masks = []

def preprocessing(input_text, tokenizer):
  '''
  Returns <class transformers.tokenization_utils_base.BatchEncoding> with the following fields:
    - input_ids: list of token ids
    - token_type_ids: list of token type ids
    - attention_mask: list of indices (0,1) specifying which tokens should considered by the model (return_attention_mask = True).
  '''
  return tokenizer.encode_plus(
                        input_text,
                        add_special_tokens = True,
                        max_length = 512,
                        padding='max_length',
                        truncation = True,
                        return_attention_mask = True,
                        return_tensors = 'pt'
                   )

for sample in text:
  encoding_dict = preprocessing(sample, tokenizer)
  token_id.append(encoding_dict['input_ids'])
  attention_masks.append(encoding_dict['attention_mask'])


token_id = torch.cat(token_id, dim = 0)
attention_masks = torch.cat(attention_masks, dim = 0)
labels = torch.tensor(labels)

We can also verify the output of tokenizer.encode_plus by inspecting tokens, their IDs and the attention mask for random text samples as follows:

In [None]:
print(token_id[0])
print(token_id[1])
print(token_id[2])
print(token_id[10].size())

def print_rand_sentence_encoding():
  '''Displays tokens, token IDs and attention mask of a random text sample'''
  index = random.randint(0, len(text) - 1)
  tokens = tokenizer.tokenize(tokenizer.decode(token_id[index]))
  token_ids = [i.numpy() for i in token_id[index]]
  attention = [i.numpy() for i in attention_masks[index]]

  table = np.array([tokens, token_ids, attention]).T
  print(tabulate(table,
                 headers = ['Tokens', 'Token IDs', 'Attention Mask'],
                 tablefmt = 'fancy_grid'))

print_rand_sentence_encoding()

## Split database
sperate the database to 80% training set and 20% validation set

In [None]:
val_ratio = 0.2
# Recommended batch size: 16, 32. See: https://arxiv.org/pdf/1810.04805.pdf
batch_size = 16

# Indices of the train and validation splits stratified by labels
train_idx, val_idx = train_test_split(
    np.arange(len(labels)),
    test_size = val_ratio,
    random_state=0,
    stratify = labels)

# Train and validation sets
train_set = TensorDataset(token_id[train_idx],
                          attention_masks[train_idx],
                          labels[train_idx])

val_set = TensorDataset(token_id[val_idx],
                        attention_masks[val_idx],
                        labels[val_idx])

# Prepare DataLoader
train_dataloader = DataLoader(
            train_set,
            sampler = RandomSampler(train_set),
            batch_size = batch_size
        )

validation_dataloader = DataLoader(
            val_set,
            sampler = SequentialSampler(val_set),
            batch_size = batch_size
        )

## Fine-tune Model functions

In [None]:
def b_tp(preds, labels):
  '''Returns True Positives (TP): count of correct predictions of actual class 1'''
  return sum([preds == labels and preds == 1 for preds, labels in zip(preds, labels)])

def b_fp(preds, labels):
  '''Returns False Positives (FP): count of wrong predictions of actual class 1'''
  return sum([preds != labels and preds == 1 for preds, labels in zip(preds, labels)])

def b_tn(preds, labels):
  '''Returns True Negatives (TN): count of correct predictions of actual class 0'''
  return sum([preds == labels and preds == 0 for preds, labels in zip(preds, labels)])

def b_fn(preds, labels):
  '''Returns False Negatives (FN): count of wrong predictions of actual class 0'''
  return sum([preds != labels and preds == 0 for preds, labels in zip(preds, labels)])

def b_metrics(preds, labels):
  '''
  Returns the following metrics:
    - accuracy    = (TP + TN) / N
    - precision   = TP / (TP + FP)
    - recall      = TP / (TP + FN)
    - specificity = TN / (TN + FP)
  '''
  preds = np.argmax(preds, axis = 1).flatten()
  labels = labels.flatten()
  tp = b_tp(preds, labels)
  tn = b_tn(preds, labels)
  fp = b_fp(preds, labels)
  fn = b_fn(preds, labels)
  b_accuracy = (tp + tn) / len(labels)
  b_precision = tp / (tp + fp) if (tp + fp) > 0 else 'nan'
  b_recall = tp / (tp + fn) if (tp + fn) > 0 else 'nan'
  b_specificity = tn / (tn + fp) if (tn + fp) > 0 else 'nan'
  return b_accuracy, b_precision, b_recall, b_specificity

## Load BART model

In [None]:
# Load the BartForSequenceClassification model
model = BartForSequenceClassification.from_pretrained(
    'facebook/bart-base',
    num_labels = 2,
    output_attentions = False,
    output_hidden_states = False,
)
#summary(model,input_size=(1,32), dtypes=['torch.IntTensor']) 

## Set optimizer

In [None]:
# Recommended learning rates (Adam): 5e-5, 3e-5, 2e-5. See: https://arxiv.org/pdf/1810.04805.pdf
optimizer = torch.optim.AdamW(model.parameters(),
                              lr = 2e-5,
                              eps = 1e-08
                              )

## Check run environment

In [None]:
# Run on GPU
model.cuda()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

create a folder to save the models

In [None]:
!mkdir Model

## Training and calculate accuracy

In [None]:
epochs = 2

for loop in trange(epochs, desc = 'Epoch'):

    # ========== Training ==========

    # Set model to training mode
    model.train()

    # Tracking variables
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0

    for step, batch in enumerate(train_dataloader):
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        optimizer.zero_grad()
        # Forward pass
        train_output = model(b_input_ids,
                             attention_mask = b_input_mask,
                             labels = b_labels)
        # Backward pass
        train_output.loss.backward()
        optimizer.step()
        # Update tracking variables
        tr_loss += train_output.loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1

    # ========== Validation ==========

    # Set model to evaluation mode
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_precision = []
    val_recall = []
    val_specificity = []

    for batch in validation_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        with torch.no_grad():
          # Forward pass
          eval_output = model(b_input_ids,
                              attention_mask = b_input_mask)
        logits = eval_output.logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        # Calculate validation metrics
        b_accuracy, b_precision, b_recall, b_specificity = b_metrics(logits, label_ids)
        val_accuracy.append(b_accuracy)
        # Update precision only when (tp + fp) !=0; ignore nan
        if b_precision != 'nan': val_precision.append(b_precision)
        # Update recall only when (tp + fn) !=0; ignore nan
        if b_recall != 'nan': val_recall.append(b_recall)
        # Update specificity only when (tn + fp) !=0; ignore nan
        if b_specificity != 'nan': val_specificity.append(b_specificity)
    
    # Save the model at the end of each epoch
    with open("Model/BartforSC_model_{}.pth".format(loop), "wb") as f:
        model.eval()
        pickle.dump(model, f)
        model.train()

    print('\n\t - Train loss: {:.4f}'.format(tr_loss / nb_tr_steps))
    print('\t - Validation Accuracy: {:.4f}'.format(sum(val_accuracy)/len(val_accuracy)))
    print('\t - Validation Precision: {:.4f}'.format(sum(val_precision)/len(val_precision)) if len(val_precision)>0 else '\t - Validation Precision: NaN')
    print('\t - Validation Recall: {:.4f}'.format(sum(val_recall)/len(val_recall)) if len(val_recall)>0 else '\t - Validation Recall: NaN')
    print('\t - Validation Specificity: {:.4f}\n'.format(sum(val_specificity)/len(val_specificity)) if len(val_specificity)>0 else '\t - Validation Specificity: NaN')


## Performance showcase

using new reviews from Rotten Tomatoes, capture from Dune: Part Two audience reviews

In [None]:
# new reviews from Rotten Tomatoes, capture from Dune: Part Two audience reviews
review = ["Very dark. I love the first one, and was so excited for part two but it just seem to be super dark to me.",
          "Not the type of movie that I was expecting for my first date. Not the kinda movie I would watch on a first date.",
          "Long, boring, music was oppressively loud .",
          "Long movie at 2 hours and 46 minutes but it moved very fast. Keeps you interested the whole way. Great visuals, story, and sound. Best movie I have seen in a while. Go see it. It wont disappoint you.",
          "Very close to the book, which is a long in-depth story. Dramatic scenes and very real characters portrayed throughout."]

best_epoch = 1

for new_sentence in review:
  # We need Token IDs and Attention Mask for inference on the new sentence
  test_ids = []
  test_attention_mask = []

  # Apply the tokenizer
  encoding = preprocessing(new_sentence, tokenizer)

  with open(f"Model/BartforSC_model_{best_epoch}.pth","rb") as f:
      model=pickle.load(f)

  # Extract IDs and Attention Mask
  test_ids.append(encoding['input_ids'])
  test_attention_mask.append(encoding['attention_mask'])
  test_ids = torch.cat(test_ids, dim = 0)
  test_attention_mask = torch.cat(test_attention_mask, dim = 0)

  # Forward pass, calculate logit predictions
  with torch.no_grad():
    output = model(test_ids.to(device), attention_mask = test_attention_mask.to(device))

  prediction = 'positive' if np.argmax(output.logits.cpu().numpy()).flatten().item() == 1 else 'negative'

  print('Input Sentence: ', new_sentence)
  print('Predicted Class: ', prediction)

## Summary
<p>In this sentiment analysis task, I use <b>BART model</b> rather than BERT model, and obtain a <b>95.22%</b> of Validation accracy on IMDB database </p>
<p>As the sentences in IMDB database is longer than the database used in example, so i changed the maximum length of token to 512. larger token size can make sure all the words in sentence is well tokenized and contain all information.</p>
<p>the batch size is 16, it is recommanded. Small batch size can improve the generalizaiton performance.</p>
<p>the learning rate is set to 2e-5, it is recommanded and it is the best choice after i tried other learning rate. No changes on eps</p>
<p>After those setting and run it, I set 2 epoches, and the first one is already the best one. 95.22% accuracy </p>
<p>Possibly due to the more noising added and the encoder-decoder full structure, the performance is even higher than bert. I got 94.14% accuracy in using BERT model, but BART can obtain 95.22%. BART and BERT model is quite similar in encoder part, so i do not need to modify a lot of the example code and can obtain a higher performance, this is a reson why i choose BART model</p>