# Sarcasm detection with BERT


## Fine tunning on a combination of datasets


### Importing the dataset


In [1]:
import pandas as pd

In [2]:
# Load the dataset
combined_df_file_path = "../datasets/combined.parquet"
combined_df = pd.read_parquet(combined_df_file_path)

# Display the first few rows of the dataset for a quick overview
combined_df.head()

Unnamed: 0,sentence,is_sarcastic
0,thirtysomething scientists unveil doomsday clo...,1.0
1,dem rep. totally nails why congress is falling...,0.0
2,eat your veggies: 9 deliciously different recipes,0.0
3,inclement weather prevents liar from getting t...,1.0
4,mother comes pretty close to using word 'strea...,1.0


### Some statistics and cleaning


In [3]:
import re

In [4]:
# Checking for any null values in the dataset
combined_df_null_check = combined_df.isnull().sum()

# Data cleaning: removing special characters and escape sequences from the sentences
combined_df["sentence"] = combined_df["sentence"].apply(
    lambda x: re.sub(r"[\n\r\t]+", " ", x)
)

# Checking the distribution of the 'is_sarcastic' column
combined_df_label_distribution = combined_df["is_sarcastic"].value_counts(
    normalize=True
)

combined_df_null_check, combined_df_label_distribution

(sentence        0
 is_sarcastic    0
 dtype: int64,
 is_sarcastic
 0.0    0.521391
 1.0    0.478609
 Name: proportion, dtype: float64)

### Splitting the dataset


In [5]:
from sklearn.model_selection import train_test_split

In [6]:
# Splitting the dataset into training, validation, and testing sets
combined_train_data, combined_test_data = train_test_split(
    combined_df, test_size=0.3, random_state=42
)
combined_val_data, combined_test_data = train_test_split(
    combined_test_data, test_size=0.5, random_state=42
)

# Showing the size of each split
combined_train_size, combined_val_size, combined_test_size = (
    len(combined_train_data),
    len(combined_val_data),
    len(combined_test_data),
)
combined_train_size, combined_val_size, combined_test_size

(28322, 6069, 6070)

### Creating the Dataset class for the BertTokenizer & PyTorch


In [7]:
from transformers import BertTokenizer
from torch.utils.data import Dataset, DataLoader
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
class SarcasticSentencesDataset(Dataset):
    """
    A custom PyTorch Dataset for the sarcastic sentences dataset.
    """

    def __init__(self, sentences, labels, tokenizer, max_len):
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, item):
        sentence = str(self.sentences[item])
        label = self.labels[item]

        # Encoding the sentences using the tokenizer
        encoding = self.tokenizer.encode_plus(
            sentence,
            add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
            max_length=self.max_len,
            return_token_type_ids=False,
            padding="max_length",
            return_attention_mask=True,
            return_tensors="pt",  # Return PyTorch tensors
            truncation=True,
        )

        return {
            "sentence": sentence,
            "input_ids": encoding["input_ids"].flatten(),
            "attention_mask": encoding["attention_mask"].flatten(),
            "labels": torch.tensor(label, dtype=torch.long),
        }

In [9]:
# Initialize the BERT tokenizer
bert_base_uncased_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Constants
COMBINED_MAX_LEN = 128 * 2
COMBINED_BATCH_SIZE = 16

# Creating instances of the SarcasticSentencesDataset
combined_train_dataset = SarcasticSentencesDataset(
    combined_train_data["sentence"].to_numpy(),
    combined_train_data["is_sarcastic"].to_numpy(),
    bert_base_uncased_tokenizer,
    COMBINED_MAX_LEN,
)

combined_val_dataset = SarcasticSentencesDataset(
    combined_val_data["sentence"].to_numpy(),
    combined_val_data["is_sarcastic"].to_numpy(),
    bert_base_uncased_tokenizer,
    COMBINED_MAX_LEN,
)

combined_test_dataset = SarcasticSentencesDataset(
    combined_test_data["sentence"].to_numpy(),
    combined_test_data["is_sarcastic"].to_numpy(),
    bert_base_uncased_tokenizer,
    COMBINED_MAX_LEN,
)

# Creating the DataLoaders for training, validation, and testing
combined_train_loader = DataLoader(
    combined_train_dataset, batch_size=COMBINED_BATCH_SIZE, shuffle=True
)
combined_val_loader = DataLoader(combined_val_dataset, batch_size=COMBINED_BATCH_SIZE)
combined_test_loader = DataLoader(combined_test_dataset, batch_size=COMBINED_BATCH_SIZE)

# Checking the first batch from the train_loader
next(iter(combined_train_loader))

{'sentence': ['ty cobb returns to old private practice in enchanted forest toadstool',
  "who needs the apple watch? this startup is building straps that make any regular watch 'smart'",
  "And they could have experienced more without gun control. You're jumping to conclusions.",
  "If you paid attention the BATFE said so.     Yes but some people don't want to close the borders. They prefer no border controls.",
  'double amputee proves he is capable of anything',
  'Nothing I have stated is incorrect. If I did you would get specific.',
  'koch network spent nearly $400 million in 2015',
  " Well, this is a record I heard daily in the spring days of 1994, but now, let's say it's some of the records I don't want to let any of my friends know I have. After the excellent ULTRAMEGA OK (just the title says all) and  the less excellent Badmotorfinger this must be considered as the complete  sell-out! The only really cool track is Ben Shepherd's &quot;Half&quot;. I  can't help thinking of pop

### Creating the train and validation loops


In [10]:
# Torch imports
import torch
from torch.utils.data import DataLoader
from torch.nn import CrossEntropyLoss, Module
from torch.optim import Optimizer
from torch.optim.lr_scheduler import _LRScheduler, LambdaLR

# Transformers imports
from transformers import (
    BertForSequenceClassification,
    AdamW,
    get_linear_schedule_with_warmup,
)

# Typing imports
from typing import Dict, Optional, List, Union

# Other libraries
from tqdm import tqdm
import numpy as np

In [11]:
device: torch.device = torch.device(
    device="cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

In [12]:
def train_epoch(
    model: Module,
    data_loader: DataLoader,
    optimizer: Optimizer,
    device: torch.device,
    scheduler: Union[_LRScheduler, LambdaLR],
    loss_fn: CrossEntropyLoss,
    n_examples: int,
    feature_keys: Optional[List[str]] = None,  # List of keys if present
) -> Dict[str, float]:
    model.train()

    losses = []
    correct_predictions = torch.Tensor([0]).to(device)

    # For calculating precision and recall
    tp_sarcasm = 0
    tn_non_sarcasm = 0
    fp_sarcasm = 0
    fn_sarcasm = 0

    for batch in tqdm(data_loader, total=len(data_loader)):
        # Process inputs for single/multiple features
        inputs = (
            {key: batch[key].to(device) for key in feature_keys}
            if feature_keys
            else {
                "input_ids": batch["input_ids"].to(device),
                "attention_mask": batch["attention_mask"].to(device),
            }
        )
        labels = batch["labels"].to(device)

        optimizer.zero_grad()

        outputs = model(**inputs)
        logits = outputs.logits

        loss = loss_fn(logits, labels)
        losses.append(loss.item())
        loss.backward()
        optimizer.step()
        scheduler.step()

        _, preds = torch.max(logits, dim=1)
        correct_predictions += torch.sum(preds == labels)

        # Update TP, TN, FP, FN counters
        tp_sarcasm += (preds & labels).sum().item()
        tn_non_sarcasm += ((~preds.byte()) & (~labels.byte())).sum().item()
        fp_sarcasm += (preds & (~labels.byte())).sum().item()
        fn_sarcasm += ((~preds.byte()) & labels).sum().item()

    # Calculate precision and recall for sarcasm class
    precision_sarcasm = (
        tp_sarcasm / (tp_sarcasm + fp_sarcasm) if (tp_sarcasm + fp_sarcasm) > 0 else 0
    )
    recall_sarcasm = (
        tp_sarcasm / (tp_sarcasm + fn_sarcasm) if (tp_sarcasm + fn_sarcasm) > 0 else 0
    )

    # Calculate precision and recall for non-sarcasm class
    precision_non_sarcasm = (
        tn_non_sarcasm / (tn_non_sarcasm + fn_sarcasm)
        if (tn_non_sarcasm + fn_sarcasm) > 0
        else 0
    )
    recall_non_sarcasm = (
        tn_non_sarcasm / (tn_non_sarcasm + fp_sarcasm)
        if (tn_non_sarcasm + fp_sarcasm) > 0
        else 0
    )

    return {
        "accuracy": correct_predictions.float().item() / n_examples,
        "precision_sarcasm": precision_sarcasm,
        "recall_sarcasm": recall_sarcasm,
        "precision_non_sarcasm": precision_non_sarcasm,
        "recall_non_sarcasm": recall_non_sarcasm,
        "loss": np.mean(losses),
    }

In [13]:
def eval_model(
    model: Module,
    data_loader: DataLoader,
    device: torch.device,
    loss_fn: CrossEntropyLoss,
    n_examples: int,
    feature_keys: Optional[List[str]] = None,
) -> Dict[str, float]:
    model.eval()

    losses = []
    correct_predictions = torch.Tensor([0]).to(device)

    # Initialize counters for precision and recall
    tp_sarcasm = 0
    tn_non_sarcasm = 0
    fp_sarcasm = 0
    fn_sarcasm = 0

    with torch.no_grad():
        for batch in tqdm(data_loader, total=len(data_loader)):
            # Process inputs for single/multiple features
            inputs = (
                {key: batch[key].to(device) for key in feature_keys}
                if feature_keys
                else {
                    "input_ids": batch["input_ids"].to(device),
                    "attention_mask": batch["attention_mask"].to(device),
                }
            )
            labels = batch["labels"].to(device)

            outputs = model(**inputs)
            logits = outputs.logits

            loss = loss_fn(logits, labels)
            losses.append(loss.item())

            _, preds = torch.max(logits, dim=1)
            correct_predictions += torch.sum(preds == labels)

            # Update TP, TN, FP, FN counters for precision and recall calculations
            tp_sarcasm += (preds & labels).sum().item()
            tn_non_sarcasm += ((~preds.byte()) & (~labels.byte())).sum().item()
            fp_sarcasm += (preds & (~labels.byte())).sum().item()
            fn_sarcasm += ((~preds.byte()) & labels).sum().item()

    # Calculate precision and recall for sarcasm class
    precision_sarcasm = (
        tp_sarcasm / (tp_sarcasm + fp_sarcasm) if (tp_sarcasm + fp_sarcasm) > 0 else 0
    )
    recall_sarcasm = (
        tp_sarcasm / (tp_sarcasm + fn_sarcasm) if (tp_sarcasm + fn_sarcasm) > 0 else 0
    )

    # Calculate precision and recall for non-sarcasm class
    precision_non_sarcasm = (
        tn_non_sarcasm / (tn_non_sarcasm + fn_sarcasm)
        if (tn_non_sarcasm + fn_sarcasm) > 0
        else 0
    )
    recall_non_sarcasm = (
        tn_non_sarcasm / (tn_non_sarcasm + fp_sarcasm)
        if (tn_non_sarcasm + fp_sarcasm) > 0
        else 0
    )

    return {
        "accuracy": correct_predictions.float().item() / n_examples,
        "precision_sarcasm": precision_sarcasm,
        "recall_sarcasm": recall_sarcasm,
        "precision_non_sarcasm": precision_non_sarcasm,
        "recall_non_sarcasm": recall_non_sarcasm,
        "loss": np.mean(losses),
    }

### Training & evaluation of the model


#### Define the train and validation loop


In [14]:
def train_and_evaluate(
    model: Module,
    train_loader: DataLoader,
    val_loader: DataLoader,
    optimizer: Optimizer,
    scheduler: Union[_LRScheduler, LambdaLR],
    loss_fn: CrossEntropyLoss,
    device: torch.device,
    num_epochs: int,
    train_dataset_len: int,
    val_dataset_len: int,
    feature_keys: Optional[List[str]] = None,
):
    best_accuracy = 0.0
    best_epoch = 0

    for epoch in range(num_epochs):
        print(f"Epoch {epoch + 1}/{num_epochs}")
        print("-" * 10)

        # Training phase
        train_output = train_epoch(
            model=model,
            data_loader=train_loader,
            optimizer=optimizer,
            device=device,
            scheduler=scheduler,
            loss_fn=loss_fn,
            n_examples=train_dataset_len,
            feature_keys=feature_keys,
        )

        train_metrics_str = " | ".join(
            f"{metric}: {value:.4f}" for metric, value in train_output.items()
        )
        print(f"Training Metrics: {train_metrics_str}")

        # Validation phase
        val_output = eval_model(
            model=model,
            data_loader=val_loader,
            device=device,
            loss_fn=loss_fn,
            n_examples=val_dataset_len,
            feature_keys=feature_keys,
        )

        val_metrics_str = " | ".join(
            f"{metric}: {value:.4f}" for metric, value in val_output.items()
        )
        print(f"Validation Metrics: {val_metrics_str}")

        # Example: Save the best model based on validation accuracy
        if val_output["accuracy"] > best_accuracy:
            best_accuracy = val_output["accuracy"]
            best_epoch = epoch
            torch.save(model.state_dict(), "best_model.pth")
            print("Saved Best Model!")

        print()

    print(f"Best Validation Accuracy: {best_accuracy:.4f} on Epoch {best_epoch + 1}")

#### Actual training and evaluation


In [43]:
# Hyperparameters
COMBINED_PRETRAINED_MODEL_NAME_OR_PATH = "bert-base-uncased"
COMBINED_NUM_LABELS = 2  # Number of labels in the dataset
COMBINED_HIDDEN_DROPOUT_PROB = 0.3  # Dropout rate
COMBINED_ATTENTION_PROBS_DROPOUT_PROB = 0.3  # Dropout rate in attention heads
COMBINED_NUM_EPOCHS = 3  # Number of epochs
COMBINED_LR = 2e-5  # Learning rate
COMBINED_WEIGHT_DECAY = 0.01  # Weight decay for regularization
COMBINED_NUM_WARMUP_STEPS = 0  # Number of warmup steps for learning rate scheduler

# Load pre-trained model
combined_model = BertForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path=COMBINED_PRETRAINED_MODEL_NAME_OR_PATH,
    num_labels=COMBINED_NUM_LABELS,
    hidden_dropout_prob=COMBINED_HIDDEN_DROPOUT_PROB,  # dropout rate,
    attention_probs_dropout_prob=COMBINED_ATTENTION_PROBS_DROPOUT_PROB,  # dropout rate in attention heads
)

# For typing purposes, check if model is an instance of Module
if not isinstance(combined_model, Module):
    raise ValueError("Model must be an instance of Module")

# Send the model to GPU if available
combined_model.to(device=device)  # type: ignore

# Optimizer
combined_optimizer = AdamW(
    combined_model.parameters(), lr=COMBINED_LR, weight_decay=COMBINED_WEIGHT_DECAY
)

# Total number of training steps
combined_total_steps = len(combined_train_loader) * COMBINED_NUM_EPOCHS

# Scheduler for learning rate
combined_scheduler = get_linear_schedule_with_warmup(
    combined_optimizer,
    num_warmup_steps=COMBINED_NUM_WARMUP_STEPS,
    num_training_steps=combined_total_steps,
)

# Loss function
combined_loss_fn = CrossEntropyLoss()

# Feature keys
combined_feature_keys = ["input_ids", "attention_mask"]

# Train and evaluate the model
train_and_evaluate(
    model=combined_model,
    train_loader=combined_train_loader,
    val_loader=combined_val_loader,
    optimizer=combined_optimizer,
    scheduler=combined_scheduler,
    loss_fn=combined_loss_fn,
    device=device,
    num_epochs=COMBINED_NUM_EPOCHS,
    train_dataset_len=len(combined_train_dataset),
    val_dataset_len=len(combined_val_dataset),
    feature_keys=combined_feature_keys,
)

# Save the model
FINE_TUNED_BERT_PATH = "sarcastic_model.pth"
torch.save(combined_model.state_dict(), FINE_TUNED_BERT_PATH)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
----------


100%|██████████| 1771/1771 [22:13<00:00,  1.33it/s]


Training Metrics: accuracy: 0.7895 | precision_sarcasm: 0.7865 | recall_sarcasm: 0.7688 | precision_non_sarcasm: 0.9996 | recall_non_sarcasm: 0.9996 | loss: 0.4281


100%|██████████| 380/380 [01:09<00:00,  5.44it/s]


Validation Metrics: accuracy: 0.8229 | precision_sarcasm: 0.8781 | recall_sarcasm: 0.7352 | precision_non_sarcasm: 0.9995 | recall_non_sarcasm: 0.9998 | loss: 0.3845
Saved Best Model!

Epoch 2/3
----------


100%|██████████| 1771/1771 [16:38<00:00,  1.77it/s]


Training Metrics: accuracy: 0.8686 | precision_sarcasm: 0.8590 | recall_sarcasm: 0.8678 | precision_non_sarcasm: 0.9998 | recall_non_sarcasm: 0.9997 | loss: 0.2979


100%|██████████| 380/380 [00:58<00:00,  6.54it/s]


Validation Metrics: accuracy: 0.8459 | precision_sarcasm: 0.9189 | recall_sarcasm: 0.7468 | precision_non_sarcasm: 0.9995 | recall_non_sarcasm: 0.9999 | loss: 0.3560
Saved Best Model!

Epoch 3/3
----------


100%|██████████| 1771/1771 [16:17<00:00,  1.81it/s]


Training Metrics: accuracy: 0.8938 | precision_sarcasm: 0.8863 | recall_sarcasm: 0.8924 | precision_non_sarcasm: 0.9998 | recall_non_sarcasm: 0.9998 | loss: 0.2469


100%|██████████| 380/380 [00:57<00:00,  6.61it/s]


Validation Metrics: accuracy: 0.8494 | precision_sarcasm: 0.9138 | recall_sarcasm: 0.7597 | precision_non_sarcasm: 0.9995 | recall_non_sarcasm: 0.9999 | loss: 0.3627
Saved Best Model!

Best Validation Accuracy: 0.8494 on Epoch 3


## Create a new model based on the pre-trained BERT model, adding review features


### Create the dataset class for the new model


In [16]:
class SarcasticProductReviewDataset(Dataset):
    """
    A PyTorch Dataset class for sarcastic product reviews with multiple text features.
    """

    def __init__(self, data, tokenizer, max_len):
        self.data = data
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        review_data = self.data.iloc[idx]
        label = review_data["is_sarcastic"]

        # Tokenizing each text feature separately
        title_encoding = self.tokenize_text_feature(review_data["title"])
        author_encoding = self.tokenize_text_feature(review_data["author"])
        product_encoding = self.tokenize_text_feature(review_data["product"])
        review_encoding = self.tokenize_text_feature(review_data["review"])

        # Convert stars rating to a tensor
        stars_rating = torch.tensor([float(review_data["stars"])], dtype=torch.float)

        return {
            "title_input_ids": title_encoding["input_ids"].flatten(),
            "title_attention_mask": title_encoding["attention_mask"].flatten(),
            "author_input_ids": author_encoding["input_ids"].flatten(),
            "author_attention_mask": author_encoding["attention_mask"].flatten(),
            "product_input_ids": product_encoding["input_ids"].flatten(),
            "product_attention_mask": product_encoding["attention_mask"].flatten(),
            "review_input_ids": review_encoding["input_ids"].flatten(),
            "review_attention_mask": review_encoding["attention_mask"].flatten(),
            "stars": stars_rating.flatten(),
            "labels": torch.tensor(label, dtype=torch.long),
        }

    def tokenize_text_feature(self, text):
        return self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
            max_length=self.max_len,  # truncate or pad to max_len
            return_token_type_ids=False,
            padding="max_length",  # pad to max_length
            return_attention_mask=True,
            return_tensors="pt",  # return tensors for PyTorch
            truncation=True,
        )

### Create the model class


In [17]:
from torch.nn import Linear, Dropout, ReLU
from transformers import BertModel

In [18]:
class ModelOutput:
    def __init__(self, logits):
        self.logits = logits


class ExtendedBertForMultiFeatureClassification(Module):
    def __init__(
        self,
        pretrained_bert_path,
        fine_tuned_bert_path,
        hidden_size,
        num_labels,
        hidden_dropout_prob,
        attention_probs_dropout_prob,
        classifier_dropout_prob,
    ):
        super().__init__()

        self.bert = BertModel.from_pretrained(
            pretrained_model_name_or_path=pretrained_bert_path,
            num_labels=num_labels,
            hidden_dropout_prob=hidden_dropout_prob,  # dropout rate,
            attention_probs_dropout_prob=attention_probs_dropout_prob,  # dropout rate in attention heads
        )
        if not isinstance(self.bert, Module):
            raise ValueError("Model must be an instance of Module")

        # Load the state_dict from the saved model
        state_dict = torch.load(fine_tuned_bert_path, map_location=torch.device("cpu"))

        # Remove the keys related to the classification head
        state_dict = {
            key: value
            for key, value in state_dict.items()
            if not key.startswith("classifier.")
        }

        # Update the state with the weighed from the fine-tuned model (excluding classifier weights)
        self.bert.load_state_dict(
            state_dict, strict=False
        )  # Set strict to False to ignore missing keys

        # Assuming features for title, author, product, review
        num_bert_features = 4  # how many BERT-encoded text features we're combining
        num_additional_features = 1  # e.g., stars

        # The feature combiner layer to merge BERT-encoded features
        self.feature_combiner = Linear(
            self.bert.config.hidden_size * num_bert_features, hidden_size
        )
        self.feature_combiner_activation = ReLU()

        # The classifier head that includes an additional hidden layer
        self.classifier_hidden = Linear(
            hidden_size + num_additional_features, hidden_size
        )
        self.classifier_hidden_activation = ReLU()

        # Final classification layer
        self.classifier = Linear(hidden_size, num_labels)
        self.dropout = Dropout(classifier_dropout_prob)

    def forward(
        self,
        title_input_ids,
        title_attention_mask,
        author_input_ids,
        author_attention_mask,
        product_input_ids,
        product_attention_mask,
        review_input_ids,
        review_attention_mask,
        stars,
    ):
        if not isinstance(self.bert, Module):
            raise ValueError("Model must be an instance of Module")

        # Process each text input through the fine-tuned BERT independently
        # Extract the last hidden state of the [CLS] token from each output
        title_cls = self.bert(
            title_input_ids, attention_mask=title_attention_mask
        ).pooler_output
        author_cls = self.bert(
            author_input_ids, attention_mask=author_attention_mask
        ).pooler_output
        product_cls = self.bert(
            product_input_ids, attention_mask=product_attention_mask
        ).pooler_output
        review_cls = self.bert(
            review_input_ids, attention_mask=review_attention_mask
        ).pooler_output

        # Combine [CLS] token outputs for all text features
        combined_cls = torch.cat(
            (
                title_cls,  # Should be (batch_size, hidden_size)
                author_cls,  # Should be (batch_size, hidden_size)
                product_cls,  # Should be (batch_size, hidden_size)
                review_cls,  # Should be (batch_size, hidden_size)
            ),
            dim=1,
        )

        # Make sure stars is 2D with shape (batch_size, 1)
        stars = stars.unsqueeze(1) if stars.dim() == 1 else stars

        # Apply dropout and pass through the affine transformation and activation
        combined_features = self.dropout(
            self.feature_combiner_activation(self.feature_combiner(combined_cls))
        )

        # Combine with the additional feature (e.g., stars)
        combined_with_additional_feature = torch.cat((combined_features, stars), dim=1)

        # Pass through the second hidden layer
        classifier_hidden_output = self.dropout(
            self.classifier_hidden_activation(
                self.classifier_hidden(combined_with_additional_feature)
            )
        )

        # Final classifier to get logits
        logits = self.classifier(classifier_hidden_output)

        return ModelOutput(logits=logits)

### Load the dataset


In [19]:
# Load the dataset
amz_combined_file_path = "../datasets/amazon_combined.parquet"
amz_combined_df = pd.read_parquet(amz_combined_file_path)

# Display the first few rows of the dataset for a quick overview
amz_combined_df.head()

Unnamed: 0,stars,title,date,author,product,review,is_sarcastic
0,1.0,"Listening to this ""Hurt"" me!","November 8, 2007","MomKKC ""momkkc""",The Sun Also Rises (Audio CD),William Hurt cannot read. At all. The cadenc...,1
1,1.0,"40% price hike, hmm","April 15, 2010",M. Barnhart,"Heineken BT06 BeerTender Tubes, Pack of 6 (Kit...","As another reviewer noted, these used to be 10...",1
2,5.0,Don't Mess With the Lupine Trinity!!!,"June 2, 2010",Jake &#34;The Wolfman&#34; Sanchez,The Mountain Three Wolf Moon Short Sleeve Tee ...,I've read several reviews from people who have...,1
3,1.0,IT'S A BLENDER!,"June 17, 2010",S. Cashdollar,Margaritaville DM1000 Frozen Concoction Maker ...,If you pay $250 for this blender you need your...,1
4,1.0,Another movie to ignore....,"April 24, 2010","Kody ""ParisHiltonFan""",Valentine's Day (DVD),A perfect date movie: you'll miss absolutely n...,1


### Clean the dataset


In [20]:
# Data cleaning: removing special characters and escape sequences from the sentences
amz_combined_df["review"] = amz_combined_df["review"].apply(
    lambda x: re.sub(r"[\n\r\t]+", " ", x)
)
amz_combined_df["product"] = amz_combined_df["product"].apply(
    lambda x: re.sub(r"[\n\r\t]+", " ", x)
)
amz_combined_df["author"] = amz_combined_df["author"].apply(
    lambda x: re.sub(r"[\n\r\t]+", " ", x)
)
amz_combined_df["title"] = amz_combined_df["title"].apply(
    lambda x: re.sub(r"[\n\r\t]+", " ", x)
)

# Checking for any null values in the dataset
amz_combined_null_check = amz_combined_df.isnull().sum()

# Checking the distribution of the 'is_sarcastic' column
amz_combined_label_distribution = amz_combined_df["is_sarcastic"].value_counts(
    normalize=True
)

amz_combined_null_check, amz_combined_label_distribution

(stars           0
 title           0
 date            0
 author          0
 product         0
 review          0
 is_sarcastic    0
 dtype: int64,
 is_sarcastic
 0    0.651515
 1    0.348485
 Name: proportion, dtype: float64)

### Split the dataset


In [21]:
# Splitting the dataset into training, validation, and testing sets
amz_combined_train_data, amz_combined_test_data = train_test_split(
    amz_combined_df,
    test_size=0.3,
    random_state=42,
    stratify=amz_combined_df["is_sarcastic"],  # Use the labels for stratification
)
amz_combined_val_data, amz_combined_test_data = train_test_split(
    amz_combined_test_data,
    test_size=0.5,
    random_state=42,
    stratify=amz_combined_test_data[
        "is_sarcastic"
    ],  # Use the labels for stratification
)

# Showing the size of each split
amz_combined_train_size, amz_combined_val_size, amz_combined_test_size = (
    len(amz_combined_train_data),
    len(amz_combined_val_data),
    len(amz_combined_test_data),
)
amz_combined_train_size, amz_combined_val_size, amz_combined_test_size

(877, 188, 189)

### Instanciate the dataset class & data loaders


In [22]:
from torch.utils.data import WeightedRandomSampler

In [23]:
# Constants for DataLoader
AMZ_COMBINED_MAX_LEN = 128 * 3
AMZ_COMBINED_BATCH_SIZE = 16

# Compute class weights inverse proportional to class frequencies
class_sample_counts = amz_combined_train_data["is_sarcastic"].value_counts()
class_weights = 1.0 / class_sample_counts
weights = amz_combined_train_data["is_sarcastic"].map(class_weights)
weights = weights.to_numpy()

sampler = WeightedRandomSampler(weights, len(weights), replacement=True)

# Instantiate the custom Dataset for Amazon product reviews
amz_combined_train_dataset = SarcasticProductReviewDataset(
    data=amz_combined_train_data,
    tokenizer=bert_base_uncased_tokenizer,
    max_len=AMZ_COMBINED_MAX_LEN,
)

amz_combined_val_dataset = SarcasticProductReviewDataset(
    data=amz_combined_val_data,
    tokenizer=bert_base_uncased_tokenizer,
    max_len=AMZ_COMBINED_MAX_LEN,
)

amz_combined_test_dataset = SarcasticProductReviewDataset(
    data=amz_combined_test_data,
    tokenizer=bert_base_uncased_tokenizer,
    max_len=AMZ_COMBINED_MAX_LEN,
)

# Create DataLoader instances for training, validation, and testing
amz_combined_train_loader = DataLoader(
    dataset=amz_combined_train_dataset,
    batch_size=AMZ_COMBINED_BATCH_SIZE,
    sampler=sampler,  # Use the WeightedRandomSampler instead of shuffle
)

amz_combined_val_loader = DataLoader(
    dataset=amz_combined_val_dataset, batch_size=AMZ_COMBINED_BATCH_SIZE
)

amz_combined_test_loader = DataLoader(
    dataset=amz_combined_test_dataset, batch_size=AMZ_COMBINED_BATCH_SIZE
)

### Instanciate the model & then necessary objects


In [24]:
# Hyperparameters
AMZ_COMBINED_PRETRAINED_MODEL_NAME_OR_PATH = COMBINED_PRETRAINED_MODEL_NAME_OR_PATH
AMZ_COMBINED_NUM_LABELS = COMBINED_NUM_LABELS
AMZ_COMBINED_HIDDEN_DROPOUT_PROB = 0.45
AMZ_COMBINED_ATTENTION_PROBS_DROPOUT_PROB = COMBINED_ATTENTION_PROBS_DROPOUT_PROB
AMZ_COMBINED_NUM_EPOCHS = 3
AMZ_COMBINED_LR = COMBINED_LR
AMZ_COMBINED_WEIGHT_DECAY = 0.05

# Specific hyperparameters for the amz model
AMZ_COMBINED_HIDDEN_SIZE = 768  # Default hidden size for BERT base
AMZ_COMBINED_STAR_HIDDEN_DROPOUT_PROB = 0.3  # Dropout rate for the star rating feature

# Path to the fine-tuned BERT model
FINE_TUNED_BERT_PATH = "sarcastic_model.pth"

# Instantiate the extended model
amz_combined_model = ExtendedBertForMultiFeatureClassification(
    pretrained_bert_path=AMZ_COMBINED_PRETRAINED_MODEL_NAME_OR_PATH,
    fine_tuned_bert_path=FINE_TUNED_BERT_PATH,
    hidden_size=AMZ_COMBINED_HIDDEN_SIZE,
    num_labels=AMZ_COMBINED_NUM_LABELS,
    hidden_dropout_prob=AMZ_COMBINED_HIDDEN_DROPOUT_PROB,
    attention_probs_dropout_prob=AMZ_COMBINED_ATTENTION_PROBS_DROPOUT_PROB,
    classifier_dropout_prob=AMZ_COMBINED_STAR_HIDDEN_DROPOUT_PROB,
)

# Send the model to GPU if available
amz_combined_model.to(device)  # type: ignore

# Optimizer
amz_combined_optimizer = AdamW(
    amz_combined_model.parameters(),
    lr=AMZ_COMBINED_LR,
    weight_decay=AMZ_COMBINED_WEIGHT_DECAY,
)

# Total number of training steps
amz_combined_total_steps = len(amz_combined_train_loader) * AMZ_COMBINED_NUM_EPOCHS

# Scheduler for learning rate
amz_combined_scheduler = get_linear_schedule_with_warmup(
    amz_combined_optimizer,
    num_warmup_steps=COMBINED_NUM_WARMUP_STEPS,
    num_training_steps=amz_combined_total_steps,
)

# Loss function
amz_combined_loss_fn = CrossEntropyLoss()

amz_combined_feature_keys = [
    "title_input_ids",
    "title_attention_mask",
    "author_input_ids",
    "author_attention_mask",
    "product_input_ids",
    "product_attention_mask",
    "review_input_ids",
    "review_attention_mask",
    "stars",
]



### Actual training and evaluation


In [25]:
# Train and evaluate the model
train_and_evaluate(
    model=amz_combined_model,
    train_loader=amz_combined_train_loader,
    val_loader=amz_combined_val_loader,
    optimizer=amz_combined_optimizer,
    scheduler=amz_combined_scheduler,
    loss_fn=amz_combined_loss_fn,
    device=device,
    num_epochs=AMZ_COMBINED_NUM_EPOCHS,
    train_dataset_len=len(amz_combined_train_dataset),
    val_dataset_len=len(amz_combined_val_dataset),
    feature_keys=amz_combined_feature_keys,
)

# Save the model
AMZ_FINE_TUNED_BERT_PATH = "amazon_sarcastic_model.pth"
torch.save(amz_combined_model.state_dict(), AMZ_FINE_TUNED_BERT_PATH)

Epoch 1/3
----------


  0%|          | 0/55 [00:00<?, ?it/s]

100%|██████████| 55/55 [02:59<00:00,  3.27s/it]


Training Metrics: accuracy: 0.6328 | precision_sarcasm: 0.6235 | recall_sarcasm: 0.6206 | precision_non_sarcasm: 0.9993 | recall_non_sarcasm: 0.9993 | loss: 0.6717


100%|██████████| 12/12 [00:12<00:00,  1.03s/it]


Validation Metrics: accuracy: 0.8085 | precision_sarcasm: 0.8718 | recall_sarcasm: 0.5231 | precision_non_sarcasm: 0.9994 | recall_non_sarcasm: 0.9999 | loss: 0.6268
Saved Best Model!

Epoch 2/3
----------


100%|██████████| 55/55 [02:50<00:00,  3.11s/it]


Training Metrics: accuracy: 0.6796 | precision_sarcasm: 0.6928 | recall_sarcasm: 0.6943 | precision_non_sarcasm: 0.9994 | recall_non_sarcasm: 0.9994 | loss: 0.6533


100%|██████████| 12/12 [00:10<00:00,  1.10it/s]


Validation Metrics: accuracy: 0.8298 | precision_sarcasm: 0.8235 | recall_sarcasm: 0.6462 | precision_non_sarcasm: 0.9995 | recall_non_sarcasm: 0.9998 | loss: 0.6165
Saved Best Model!

Epoch 3/3
----------


100%|██████████| 55/55 [02:49<00:00,  3.09s/it]


Training Metrics: accuracy: 0.7571 | precision_sarcasm: 0.7783 | recall_sarcasm: 0.7423 | precision_non_sarcasm: 0.9995 | recall_non_sarcasm: 0.9996 | loss: 0.6303


100%|██████████| 12/12 [00:11<00:00,  1.08it/s]


Validation Metrics: accuracy: 0.8245 | precision_sarcasm: 0.8077 | recall_sarcasm: 0.6462 | precision_non_sarcasm: 0.9995 | recall_non_sarcasm: 0.9998 | loss: 0.6131

Best Validation Accuracy: 0.8298 on Epoch 2
