<a href="https://colab.research.google.com/github/numustafa/GenAI/blob/main/AI_and_ML_fundamentals/sarcasm_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Intro
This is a mini project (making inference), based on the BERT model briefly introduced in the AI ML foundation of Gen AI course. This project is a text classification project based on data from kaggle, and using Pytorch framework.




# 2. Necessary Libs & Data
We will be using data from [Kaggle](https://rishabhmisra.github.io/publications/). Its a json file, where each element consists of:
* Heading
* Sarscasm or not
* link to the article.

In [1]:
!pip install transformers --quiet
!pip install opendatasets --quiet

import opendatasets as od
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.utils.data import random_split
from torch.optim import Adam
from sklearn.metrics import accuracy_score, classification_report
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModel
from tqdm import tqdm

od.download('https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection')


Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: numustafa
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection
Downloading news-headlines-dataset-for-sarcasm-detection.zip to ./news-headlines-dataset-for-sarcasm-detection


100%|██████████| 3.30M/3.30M [00:00<00:00, 103MB/s]







In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


In [3]:
data_df = pd.read_json('/content/news-headlines-dataset-for-sarcasm-detection/Sarcasm_Headlines_Dataset.json', lines=True)
data_df.dropna(inplace = True)
data_df.drop_duplicates(inplace = True)
data_df.drop(['article_link'], axis = 1, inplace = True)
data_df.head()

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


In [4]:
print(data_df.shape)

(26708, 2)


# 3. Model
using [BERT](https://huggingface.co/google-bert/bert-base-uncased?library=transformers) model for testing and evaluation. This model does not have a classification head, which means in the tuning section we need to fine-tune it.

Since we are fine-tuning for sarcasm detection, we will be using AutoModel only - b/c we need raw embeddings for our next task.



In [5]:
# Load model directly
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = AutoModel.from_pretrained("google-bert/bert-base-uncased") # this is a pre-trained model. we need to configure the layers customizable to our problem.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

# 4. Train Test Split
* Using Pytorch Dataset to convert our Panda Dataframe into Pytorch Dataset.
* Split Dataset into train, validation and test



In [6]:
# Custom PyTorch Dataset for Sarcasm Detection
class SarcasmDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length = 100):
        self.texts = texts.tolist()  # Convert to a list for easy indexing
        self.labels = torch.tensor(labels.values, dtype=torch.long)  # Convert labels to tensor
        self.tokenizer = tokenizer    # Store tokenizer
        self.max_length = max_length  # Max token length for BERT

    def __len__(self):
        return len(self.labels)        # Return dataset size

    def __getitem__(self, idx):
        text = str(self.texts[idx])    # Convert text to string
        label = self.labels[idx]       # Get label for the index

        # Tokenize text using BERT
        encoding = self.tokenizer(
            text,
            add_special_tokens=True,      # add
            truncation=True,              # truncate long texts
            padding='max_length',         # pad texts to max length
            max_length=self.max_length,   # max token length
            return_tensors='pt'           # return pytorch tensors
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(),        # Token ids
            'attention_mask': encoding['attention_mask'].squeeze(),
            'label': label
        }

# Define features (headlines) and labels (is_sarcastic)
X = data_df['headline']
y = data_df['is_sarcastic']

# Create PyTorch Dataset
dataset = SarcasmDataset(X, y, tokenizer)

In [7]:
# Verify a sample
sample = dataset[0]
print(sample)

{'input_ids': tensor([  101,  2280, 18601,  3401,  3573,  7805,  9790,  2015,  2058,  3595,
         1005,  2304,  3642,  1005,  2005,  7162,  4497,  7347,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 

In [8]:
# Define split sizes
train_size = int(0.7 * len(dataset))  # 70% training
val_size = int(0.15 * len(dataset))    # 15% validation
test_size = len(dataset) - train_size - val_size  # Remaining 10% for testing

# Ensure reproducibility
torch.manual_seed(42)

# Split dataset
train_dataset, val_dataset, test_dataset = random_split(dataset, [train_size, val_size, test_size])

# Print sizes
print(f"Train: {len(train_dataset)}, Validation: {len(val_dataset)}, Test: {len(test_dataset)}")


Train: 18695, Validation: 4006, Test: 4007


# 5. Hyperparameter for the Model

In [9]:
# Create DataLoaders
batch_size = 32
epochs = 10
lr = 1e-4

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Check a batch
for batch in train_loader:
    print(batch["input_ids"].shape)  # Should be (batch_size, max_length)
    print(batch["attention_mask"].shape)  # Should be (batch_size, max_length)
    print(batch["label"].shape)  # Should be (batch_size,)
    break

torch.Size([32, 100])
torch.Size([32, 100])
torch.Size([32])


# 6. Model Building
Adding a custmization on top of BERT, for training purposes.



In [10]:
class MyModel(nn.Module):
    def __init__(self, bert):
      super(MyModel, self).__init__()
      self.bert = bert
      self.dropout = nn.Dropout(0.25)
      self.fc = nn.Linear(768, 384)
      self.relu = nn.ReLU()
      self.fc2 = nn.Linear(384, 1)
      self.sigmoid = nn.Sigmoid()

    def forward(self, input_ids, attention_mask):
      pooled_outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)[0][:,0]
      output = self.fc(pooled_outputs)
      output = self.dropout(output)
      output = self.relu(output)
      output = self.fc2(output)
      output = self.sigmoid(output)
      return output


In [11]:
# freeze all parameters - feature extraction: model trained only on layers fc and fc2
for param in model.parameters():
  param.requires_grad = False
model = MyModel(model)
model.to(device)


MyModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affin

In [12]:
# training parameters to optimize
criterion = nn.BCELoss()
optimizer = Adam(model.parameters(), lr=lr)

In [None]:

from tqdm import tqdm  # Progress bar

# Lists for tracking loss and accuracy
total_loss_train_plot = []
total_loss_val_plot = []
total_acc_train_plot = []
total_acc_val_plot = []

# Training loop
for epoch in range(epochs):
    # Track total loss and accuracy
    total_acc_train = 0
    total_loss_train = 0
    total_acc_val = 0
    total_loss_val = 0

    # Training Phase
    model.train()  # Set the model to training mode
    for data in tqdm(train_loader, desc=f"Epoch {epoch+1} [Training]"):
        input_ids = data['input_ids'].to(device)
        attention_mask = data['attention_mask'].to(device)
        labels = data['label'].float().to(device)  # Ensure labels are float for BCEWithLogitsLoss

        # Forward pass
        prediction = model(input_ids, attention_mask).squeeze()

        # Compute loss
        batch_loss = criterion(prediction, labels)
        total_loss_train += batch_loss.item()

        # Convert logits to binary predictions for accuracy calculation
        pred_labels = (torch.sigmoid(prediction) > 0.5).float()
        acc = (pred_labels == labels).float().mean().item()
        total_acc_train += acc

        # Backpropagation
        optimizer.zero_grad()
        batch_loss.backward()
        optimizer.step()

    # Validation Phase (No Gradients Needed)
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
        for data in tqdm(val_loader, desc=f"Epoch {epoch+1} [Validation]"):
            input_ids = data['input_ids'].to(device)
            attention_mask = data['attention_mask'].to(device)
            labels = data['label'].float().to(device)

            # Forward pass
            prediction = model(input_ids, attention_mask).squeeze()

            # Compute loss
            batch_loss = criterion(prediction, labels)
            total_loss_val += batch_loss.item()

            # Convert logits to binary predictions
            pred_labels = (torch.sigmoid(prediction) > 0.5).float()
            acc = (pred_labels == labels).float().mean().item()
            total_acc_val += acc

    # Compute average loss and accuracy
    avg_train_loss = total_loss_train / len(train_loader)
    avg_train_acc = total_acc_train / len(train_loader)
    avg_val_loss = total_loss_val / len(val_loader)
    avg_val_acc = total_acc_val / len(val_loader)

    # Append for plotting
    total_loss_train_plot.append(avg_train_loss)
    total_acc_train_plot.append(avg_train_acc)
    total_loss_val_plot.append(avg_val_loss)
    total_acc_val_plot.append(avg_val_acc)

    # Print Progress
    print(f"\nEpoch {epoch+1} Summary:")
    print(f"Train Loss: {avg_train_loss:.4f} | Train Accuracy: {avg_train_acc:.4f}")
    print(f"Validation Loss: {avg_val_loss:.4f} | Validation Accuracy: {avg_val_acc:.4f}\n")


Epoch 1 [Training]:   1%|          | 5/585 [01:00<1:56:43, 12.08s/it]

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5))

ax1.plot(total_loss_train_plot, label='Train')
ax1.plot(total_loss_val_plot, label='Validation')
ax1.set_title('Loss')
ax1.legend()

ax2.plot(total_acc_train_plot, label='Train')
ax2.plot(total_acc_val_plot, label='Validation')
ax2.set_title('Accuracy')

plt.show()