<a href="https://colab.research.google.com/github/pedrofuentes79/RNNs/blob/master/Sentiment-Analysis/sentAnalysis_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook I will use a transformed-based model to solve the same problem, determining if a review is positive or negative.

The objective is to fine tune a pre-trained BERT model for sequence classification. Since the BERT model is very complex and large, I will use 1000 training examples to adapt it to my specific task. However, the validation and training sets will be the same, by using the same dataset and random state throughout the notebooks. This is to ensure a fair comparison, given that all models are evaluated on the same sets.   
Again, the data preprocessing steps are very similar to the [main notebook](https://github.com/pedrofuentes79/RNNs/blob/master/Sentiment-Analysis/sentAnalysis_Main_(RNN).ipynb), so I will not explain them again. However, since this model uses PyTorch, some things are quite different.

# Imports and dataset

In [None]:
!pip install --upgrade transformers;
!pip install transformers[torch];
!pip install accelerate -U;
!pip install pytorch-lightning;


Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m91.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m50.5 MB/s[0m eta [36m0:00:0

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2.14.

In [None]:
import tensorflow as tf
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow.keras.optimizers import AdamW, Adam

import torch
from torch.utils.data import IterableDataset, TensorDataset, DataLoader
import matplotlib.pyplot as plt
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from google.colab import drive
drive.mount("/content/drive")

# Imitate the same dataset as the other models for reproducibility
# Then I will limit training data to fewer rows since this is fine tuning
# However, I will use the same data to validate and test the model
SAMPLE_SIZE = 50000

df = pd.read_csv('/content/drive/MyDrive/ColabProjects/amazonreviews_relevantcolumns.csv')

df = df.sample(n=SAMPLE_SIZE, random_state=27)


Mounted at /content/drive


In [None]:
# Constants
MAXLEN = 128

In [None]:
text = df["Text"]
labels = df["Score"].map({1:0, 2:0, 3:0, 4:1, 5:1}).astype("int32").to_numpy()

labels = torch.tensor(labels).view(-1, 1)

# Tokenize
This model requires a specific tokenizer, different from the Word2Vec and embedding layers used in the other notebooks. This tokenizer is imported from the <b>transformers</b> library.  
After tokenizing, the input ids and attention masks are separated so that they are fed into the model separately.

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

encoding = tokenizer.batch_encode_plus(text, max_length=MAXLEN, padding="max_length", truncation=True, return_attention_mask=True, return_tensors="pt")

input_ids = encoding["input_ids"]
attention_mask = encoding["attention_mask"]


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Sanity check shapes
print(input_ids.size())
print(attention_mask.size())
print(labels.size())

torch.Size([50000, 128])
torch.Size([50000, 128])
torch.Size([50000, 1])


# Data split
Here, I did the same split as in most notebooks, with the addition of the last lines, in which I reduce the size of the training set to 1000 due to computational limitations.

In [None]:
# Perform data splitting
train_inputs, temp_inputs, train_attention_masks, temp_attention_masks, train_labels, temp_labels = train_test_split(input_ids, attention_mask, labels, test_size=0.3, random_state=27)
val_inputs, test_inputs, val_attention_masks, test_attention_masks, val_labels, test_labels = train_test_split(temp_inputs, temp_attention_masks, temp_labels, test_size=0.5, random_state=27)

# Only get 1000 rows for the training data
train_inputs = train_inputs[:5000]
train_attention_masks = train_attention_masks[:5000]
train_labels = train_labels[:5000]

Here, the datasets have to be in the TensorDataset format (from PyTorch) to be fed into the model

In [None]:
train_dataset = TensorDataset(train_inputs, train_attention_masks, train_labels)
val_dataset = TensorDataset(val_inputs, val_attention_masks, val_labels)
test_dataset = TensorDataset(test_inputs, test_attention_masks, test_labels)

In [None]:
from datasets import Dataset

train_dataset_huggingface = Dataset.from_dict({
    'input_ids': train_inputs,
    'attention_mask': train_attention_masks,
    'label': train_labels
})

val_dataset_huggingface = Dataset.from_dict({
    'input_ids': val_inputs,
    'attention_mask': val_attention_masks,
    'label': val_labels
})

test_dataset_huggingface = Dataset.from_dict({
    'input_ids': test_inputs,
    'attention_mask': test_attention_masks,
    'label': test_labels
})

In [None]:
batch_size = 32

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Model
Here, I import the model from the transformers library (huggingface). num_labels is set to 1 since that is the number of units I want the activation function to have.

In [None]:
from transformers import BertForSequenceClassification

bert = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=1)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training

This is the custom SentimentClassifier class, which inherits from PyTorch Lightning's LightningModule. This is useful to customize the training steps, choose specifically what metrics are logged and how they are computed.

In [None]:
import pytorch_lightning as pl

class SentimentClassifier(pl.LightningModule):
    def __init__(self, learning_rate=2e-5):
        super(SentimentClassifier, self).__init__()
        self.model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=1)
        self.learning_rate = learning_rate
        self.loss_fn = torch.nn.BCELoss()

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)

        logits = outputs.logits

        labels = labels.float() if labels is not None else None  # Convert labels to Float data type

        if labels is not None:
            loss = self.loss_fn(logits, labels)
            return {"loss": loss, "logits": logits}
        else:
            return {"logits": logits}

    def training_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        # get the probs and then flatten them to be fed into the loss function.
        # otherwise it will cause shape conflicts, since the labels are flattened.
        probs = self(input_ids, attention_mask).view(-1)
        loss = torch.nn.BCELoss()(probs, labels.float())

        self.log('train_loss', loss)
        return loss

    def validation_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch

        probs = self(input_ids, attention_mask).view(-1)
        loss = torch.nn.BCELoss()(probs, labels.float())

        # Convert probabilities to binary predictions (0 or 1) based on a threshold
        preds = (probs >= 0.5).long()

        correct = (preds == labels).sum().item()
        total = labels.size(0)

        self.log('val_loss', loss)
        self.log('val_accuracy', correct / total, prog_bar=True)

    def test_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        probs = self(input_ids, attention_mask)

        # Convert probabilities to binary predictions (0 or 1) based on a threshold
        preds = (probs >= 0.5).long().view(-1)

        # accuracy
        correct = (preds == labels).sum().item()
        total = labels.size(0)
        self.log('test_accuracy', correct / total)


        TP = ((preds == 1) & (labels == 1)).sum()
        TN = ((preds == 0) & (labels == 0)).sum()
        FN = ((preds == 0) & (labels == 1)).sum()
        FP = ((preds == 1) & (labels == 0)).sum()

        # all positive and negative predictions, not the actual labels
        AP = (preds == 1).sum()
        AN = (preds == 0).sum()

        # precision for positive and negative sentiment
        precision_1 = TP / AP if AP != 0 else 0
        precision_0 = TN / AN if AN != 0 else 0

        # recall for positive and negative sentiment
        recall_1 = TP / (TP + FN) if (TP + FN != 0) else 0
        recall_0 = TN / (TN + FP) if (TN + FP != 0) else 0

        # f1 for positive and negative sentiment
        f1_1 = 2 * precision_1 * recall_1 / (precision_1 + recall_1) if (precision_1 + recall_1 != 0) else 0
        f1_0 = 2 * precision_0 * recall_0 / (precision_0 + recall_0) if (precision_0 + recall_0 != 0) else 0

        # log the metrics
        self.log('test_precision_1', precision_1)
        self.log('test_precision_0', precision_0)
        self.log('test_recall_1', recall_1)
        self.log('test_recall_0', recall_0)
        self.log('test_f1_1', f1_1)
        self.log('test_f1_0', f1_0)

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate)
        return optimizer



classifier = SentimentClassifier()

# Instantiate and train the model
trainer = pl.Trainer(max_epochs=3)
trainer.fit(classifier, train_loader, val_loader)

## Validation and testing

In [None]:
val_results = trainer.validate(dataloaders=val_loader)

In [None]:
test_results = trainer.test(dataloaders=test_loader)