<a href="https://colab.research.google.com/github/pedrofuentes79/RNNs/blob/master/Sentiment-Analysis/sentAnalysis_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook I will use a transformed-based model to solve the same problem, determining if a review is positive or negative.

The objective is to fine tune a pre-trained BERT model for sequence classification. Since the BERT model is very complex and large, I will use 1000 training examples to adapt it to my specific task. However, the validation and training sets will be the same, by using the same dataset and random state throughout the notebooks. This is to ensure a fair comparison, given that all models are evaluated on the same sets.   
Again, the data preprocessing steps are very similar to the [main notebook](https://github.com/pedrofuentes79/RNNs/blob/master/Sentiment-Analysis/sentAnalysis_Main_(RNN).ipynb), so I will not explain them again. However, since this model uses PyTorch, some things are quite different.

# Imports and dataset

In [None]:
!pip install --upgrade transformers;
!pip install transformers[torch];
!pip install accelerate -U;
!pip install pytorch-lightning;

In [2]:
import tensorflow as tf
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow.keras.optimizers import AdamW, Adam

import torch
from torch.utils.data import IterableDataset, TensorDataset, DataLoader
import matplotlib.pyplot as plt
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
from google.colab import drive
drive.mount("/content/drive")

# Imitate the same dataset as the other models for reproducibility
# Then I will limit training data to fewer rows since this is fine tuning
# However, I will use the same data to validate and test the model
SAMPLE_SIZE = 50000

df = pd.read_csv('/content/drive/MyDrive/ColabProjects/amazonreviews_relevantcolumns.csv')

df = df.sample(n=SAMPLE_SIZE, random_state=27)


Mounted at /content/drive


In [4]:
# Constants
MAXLEN = 128

In [5]:
text = df["Text"]
labels = df["Score"].map({1:0, 2:0, 3:0, 4:1, 5:1}).astype("int32").to_numpy()

labels = torch.tensor(labels)

# Tokenize
This model requires a specific tokenizer, different from the Word2Vec and embedding layers used in the other notebooks. This tokenizer is imported from the <b>transformers</b> library.  
After tokenizing, the input ids and attention masks are separated so that they are fed into the model separately.

In [6]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

encoding = tokenizer.batch_encode_plus(text, max_length=MAXLEN, padding="max_length", truncation=True, return_attention_mask=True, return_tensors="pt")

input_ids = encoding["input_ids"]
attention_mask = encoding["attention_mask"]


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [7]:
# Sanity check shapes
print(input_ids.size())
print(attention_mask.size())

torch.Size([50000, 128])
torch.Size([50000, 128])


# Data split
Here, I did the same split as in most notebooks, with the addition of the last lines, in which I reduce the size of the training set to 1000 due to computational limitations.

In [8]:
# Perform data splitting
train_inputs, temp_inputs, train_attention_masks, temp_attention_masks, train_labels, temp_labels = train_test_split(input_ids, attention_mask, labels, test_size=0.3, random_state=27)
val_inputs, test_inputs, val_attention_masks, test_attention_masks, val_labels, test_labels = train_test_split(temp_inputs, temp_attention_masks, temp_labels, test_size=0.5, random_state=27)

# Only get 1000 rows for the training data
train_inputs = train_inputs[:1000]
train_attention_masks = train_attention_masks[:1000]
train_labels = train_labels[:1000]

Here, the datasets have to be in the TensorDataset format (from PyTorch) to be fed into the model

In [9]:
# Create TensorDataset instances
train_dataset = TensorDataset(train_inputs, train_attention_masks, train_labels)
val_dataset = TensorDataset(val_inputs, val_attention_masks, val_labels)
test_dataset = TensorDataset(test_inputs, test_attention_masks, test_labels)

In [10]:
batch_size = 32

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Model
Here, I import the model from the transformers library (huggingface). num_labels is set to 1 since that is the number of units I want the activation function to have.

In [11]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=1)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training

This is the custom SentimentClassifier class, which inherits from PyTorch Lightning's LightningModule. This is useful to customize the training steps, choose specifically what metrics are logged and how they are computed.

In [12]:
import pytorch_lightning as pl

# Define your model as a PyTorch Lightning module
class SentimentClassifier(pl.LightningModule):
    def __init__(self, model, learning_rate=2e-5):
        super(SentimentClassifier, self).__init__()
        self.model = model
        self.learning_rate = learning_rate

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.model(input_ids, attention_mask=attention_mask, labels=labels)
        logits = outputs.logits
        probs = torch.sigmoid(logits)
        return probs

    def training_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        # get the probs and then flatten them to be fed into the loss function.
        # otherwise it will cause shape conflicts, since the labels are flattened.
        probs = self(input_ids, attention_mask).view(-1)
        loss = torch.nn.BCELoss()(probs, labels.float())

        self.log('train_loss', loss)
        return loss



    def validation_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch

        probs = self(input_ids, attention_mask).view(-1)
        loss = torch.nn.BCELoss()(probs, labels.float())

        # Convert probabilities to binary predictions (0 or 1) based on a threshold
        preds = (probs >= 0.5).long()

        correct = (preds == labels).sum().item()
        total = labels.size(0)

        self.log('val_loss', loss)
        self.log('val_accuracy', correct / total, prog_bar=True)


    def test_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        probs = self(input_ids, attention_mask)

        # Convert probabilities to binary predictions (0 or 1) based on a threshold
        preds = (probs >= 0.5).long().view(-1)

        correct = (preds == labels).sum().item()
        total = labels.size(0)

        self.log('test_accuracy', correct / total)

        # precision
        true_positives = (preds == 1) & (labels == 1)
        all_positives = preds == 1
        precision = true_positives.sum() / all_positives.sum()
        self.log('test_precision', precision)

        # recall
        false_negatives = (preds == 0) & (labels == 1)
        recall = true_positives.sum() / (true_positives.sum() + false_negatives.sum())
        self.log("test_recall", recall)

        # f1 score
        f1 = 2 * precision * recall / (precision + recall)
        self.log("test_f1_score", f1)

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate)
        return optimizer



# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',  # Evaluate after each epoch
    save_total_limit=2,  # Limit the number of checkpoints saved
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=3e-5,
)

classifier = SentimentClassifier(model)



# Instantiate a PyTorch Lightning Trainer
trainer = pl.Trainer(max_epochs=3)

# Train the model
trainer.fit(classifier, train_loader, val_loader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                          | Params
--------------------------------------------------------
0 | model | BertForSequenceClassification | 109 M 
--------------------------------------------------------
109 M     Trainable params
0         Non-trainable params
109 M     Total params
437.932   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

  rank_zero_warn(


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


## Validation and testing

In [13]:
val_results = trainer.validate(dataloaders=val_loader)

  rank_zero_warn(
INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/lightning_logs/version_0/checkpoints/epoch=2-step=96.ckpt
INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from the checkpoint at /content/lightning_logs/version_0/checkpoints/epoch=2-step=96.ckpt


Validation: 0it [00:00, ?it/s]

In [14]:
test_results = trainer.test(dataloaders=test_loader)

  rank_zero_warn(
INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/lightning_logs/version_0/checkpoints/epoch=2-step=96.ckpt
INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from the checkpoint at /content/lightning_logs/version_0/checkpoints/epoch=2-step=96.ckpt


Testing: 0it [00:00, ?it/s]