<a href="https://colab.research.google.com/github/mduffy23/Sarcasm-Detection-AIML-Final-Project/blob/main/Sarcasm_Detection_Final_Project_AIML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Project - Sarcasm Detection

There is a common experience amoung people of reading a text and not being sure if it is supposed to be sarcastic or not (or even being on the reverse, trying to text and make sure the other person knows you are being sarcastic). It is a bit awkward and difficult to interpret without inflection that one usually uses when saying something sarcastic.

This project aims to generate a model that can predict of a statement is sarcastic or not, purely based on the text used. Capitalization, punctuation, and word choice can all help indicate whether or not the writer of a comment meant to express the statement sarcastically.

- The baseline model if TFID.
- The final model is a fine tuned RoBERTA model.

## Imports

In [None]:
from google.colab import drive
drive.mount('/content/drive')
project_path = '/content/drive/MyDrive/Final AIML Project - Sarcasm Detection'

In [None]:
#!pip install -q transformers datasets accelerate sentencepiece
#!pip install -q torch --index-url https://download.pytorch.org/whl/cu121
#!pip install optuna

In [None]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud
from scipy.sparse import hstack
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from tqdm.notebook import tqdm
from datasets import Dataset, ClassLabel, Features
from sklearn.metrics import classification_report, confusion_matrix
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import TrainingArguments, Trainer
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')
import torch
import torch.nn as nn
from datasets import Dataset
from transformers import (
    RobertaTokenizerFast,
    RobertaModel,
    TrainingArguments,
    Trainer,
    default_data_collator
)
from transformers.modeling_outputs import SequenceClassifierOutput
from sklearn.metrics import f1_score, accuracy_score, precision_recall_fscore_support, confusion_matrix
import optuna

## Dataset

Dataset comes from https://www.kaggle.com/datasets/danofer/sarcasm. The dataset was downloaded via kaggle hub on my local machine and moved into google drive for easy access. This dataset contains 1,010,826 comments from Reddit, balanced between not sarcastic and sarcastic comments. The sarcasm data was generated by scraping comments containing a sarcasm tag.

I initally planned to use a headline sarcasm dataset, but this was insufficient because the *sarcastic* comments did not relfect the sarcasm used by people in conversation. The idea behind this project is that one could get a text out of no where (or in the midst of chat) and get a reasonable prediction of the text they are reading is sarcastic or not. Reddit was a far better alternative to capture sarcasm that people utilzie.

In [None]:
sarcasm = pd.read_csv(project_path + '/train-balanced-sarcasm.csv')
print(sarcasm.info())
display(sarcasm.head())
display(pd.DataFrame(sarcasm['label'].value_counts().reset_index().rename(columns={'index': 'label', 'label': 'count'})))

In [None]:
display(sarcasm.isna().sum())
sarcasm.dropna(inplace=True)

Must drop any null comments.

In [None]:
sarcasm['subreddit'].value_counts().head(20)

There are many different *subreddits* that are specific to particular topics. The most common is a very generic subreddit called "AskReddit" which is followed by more specific domains like politics and worldnews.

One important consideration for this project is the lack of context the model gets. The "AskReddit" subreddit poses questions that are answered by people. Sometimes the answers are sarcastic, but one would only know that based on the question at hand. Perhaps putting in a feature for parent comment would help with the association of more contextual sarcasm, but I want to focus more on catching sarcasm that comes out of the blue. Not using any parent comment for context should help the model understand the essence of sarcasm, rather than how to respond to someone sarcastically. Maybe a sarcasm generation model would be better suited to use a parent comment for a training feature.

Taking a fairly large sample of the data and then splitting it into X and y.

In [None]:
sample_sarcasm, _ = train_test_split(sarcasm, train_size=0.75, random_state=102, stratify=sarcasm['label'])
sample_sarcasm.reset_index(drop=True, inplace=True)
sample_sarcasm.shape

In [None]:
X = sample_sarcasm['comment']
y = sample_sarcasm['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=102, stratify=y)

## Text Processing

### Sarcasm Indications

I want the model to learn sarcasm from training on the Reddit comments, but I also want to help it along by identifying common indicators of sarcasm. I created function to add a lexical sarcasm feature to any string of text. This will be used as a secondary variable in both the baseline and final models.

In [None]:
intensifiers = {'literally', 'totally', 'completely', 'absolutely', 'seriously', 'really', 'just' 'sooo', 'wooow'}
irony = {'yeah right', 'oh great', 'just perfect', 'of course', 'sure thing', 'oh yeah', 'oh,'}
sarcasm_punctuation = {'!', '!!', '!?', '?!'}

In [None]:
def count_elongated(token):
    return 1 if re.search(r"(.)\1\1+", token) else 0

def lexical_features_one(text):
    text_lower = text.lower()
    tokens = re.findall(r'\w+|\S', text)

    intensifier_count = sum(t.lower() in intensifiers for t in tokens)

    irony_present = 1 if any(phrase in text_lower for phrase in irony) else 0

    punctuation_count = sum(text.count(p) for p in sarcasm_punctuation)

    cap_tokens = sum(1 for t in tokens if len(t) > 2 and t.isupper())
    cap_ratio = cap_tokens / max(len(tokens), 1)

    elongated_count = sum(count_elongated(t) for t in tokens)

    return [
        intensifier_count,
        irony_present,
        punctuation_count,
        cap_ratio,
        elongated_count
    ]

def build_lexical_matrix(text_list):
    return np.vstack([lexical_features_one(t) for t in text_list])

### Lemmatizer

This is for TFID, not RoBerta. A transformer model like RoBerta...

In [None]:
# Lemmatize functions

## For Modelling
def sarcasm_lemma_tokenizer(text):
  lemmatizer = WordNetLemmatizer()
  text = text.lower()
  tokens = word_tokenize(text)
  lemmas = [lemmatizer.lemmatize(tok) if tok.isalpha() else tok for tok in tokens]
  return lemmas

## For visuals
def sarcasm_lemma_vectorizer_no_punc(text):
  tokens = sarcasm_lemma_tokenizer(text)
  return [tok.lower() for tok in tokens if tok.isalpha() and tok not in sarcasm_punctuation]


## Data Exploration

In [None]:
def plot_top_grams(df, label_val, n_gram_range = (1, 2), top=20):
  df_label = df[df['label'] == label_val]['comment']
  vectorizer = CountVectorizer(tokenizer=sarcasm_lemma_vectorizer_no_punc, token_pattern=None, lowercase=True, ngram_range=n_gram_range, stop_words='english')
  X = vectorizer.fit_transform(df_label)

  counts = np.asarray(X.sum(axis=0)).flatten()
  vocab = vectorizer.get_feature_names_out()

  freq_df = pd.DataFrame({'ngram': vocab, 'count': counts})
  freq_df = freq_df.sort_values(by='count', ascending=False).head(top)

  plt.figure(figsize=(12, 8))
  bars = plt.barh(freq_df['ngram'], freq_df['count'], color='blue', alpha = 0.7)
  plt.gca().invert_yaxis() # Show highest frequency on top
  plt.xticks(fontsize=12)
  plt.yticks(fontsize=12)
  plt.xlabel('Count', fontsize=14)
  plt.ylabel('N-gram', fontsize=18)
  plt.title(f'Top {top} {n_gram_range}-grams for {'Sarcastic' if label_val == 1 else 'Non-sarcastic'} Comments', fontsize=18)

  # Add data labels
  for bar in bars:
      plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, f' {int(bar.get_width())}', va='center')

  plt.tight_layout()
  sns.despine()
  plt.show()

plot_top_grams(sample_sarcasm, 1)
plot_top_grams(sample_sarcasm, 0)

## Baseline TF-IDF

In [None]:
# Define a train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=102, stratify=y)

In [None]:
# TF-IDF vectorization
vectorizer = TfidfVectorizer(tokenizer=sarcasm_lemma_tokenizer, ngram_range=(1, 2), min_df=2, preprocessor=None, token_pattern=None)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [None]:
# Add lexical features
X_train_lex = build_lexical_matrix(X_train)
X_test_lex = build_lexical_matrix(X_test)

# Add to dataset
X_train = hstack([X_train_vec, X_train_lex])
X_test = hstack([X_test_vec, X_test_lex])

In [None]:
# Linear classifier
clf = LogisticRegression(max_iter=500, class_weight = 'balanced')
clf.fit(X_train, y_train)

# Evaluation
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Simple logisitic regression does a nice job on generally learning sarcasm (significantly better than guessing).

## BERT

In [None]:
dataset = Dataset.from_dict({"text": X, "label": y})
features = Features({"text": dataset.features["text"], "label": ClassLabel(num_classes=2, names=[0, 1])})
dataset = dataset.cast(features)
dataset = dataset.train_test_split(test_size=0.20, seed=102, stratify_by_column='label')

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

model = BertForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

In [None]:
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True)

def add_lexical_features(batch):
    features = [lexical_features_one(text) for text in batch['text']]
    batch['lexical_features'] = features
    return batch

tokenized = dataset.map(tokenize, batched=True)
tokenized = tokenized.map(add_lexical_features, batched=True)
tokenized = tokenized.remove_columns(["text"])
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch")

In [None]:
training_args = TrainingArguments(
    output_dir= project_path + "/test",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=0.001,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"]
)

trainer.train()

In [None]:
trainer.evaluate()

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

sentences = ['Great work, no one has ever thought of that.']

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
  outputs = model(**inputs)
  logits = outputs.logits

probs = torch.softmax(logits, dim=1)
preds = torch.argmax(probs, dim=1)

print(probs)
print(preds)

In [None]:
outputs = trainer.predict(tokenized["test"])

In [None]:
outputs = trainer.predict(tokenized["test"])
logits = outputs.predictions
labels = outputs.label_ids
preds = np.argmax(logits, axis=1)
print(classification_report(labels, preds))
print(confusion_matrix(labels, preds))

In [None]:
model.save_pretrained(project_path + '/withsarcasmref')

In [None]:
tokenizer.save_pretrained(project_path + '/withsarcasmref')

In [None]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")

In [None]:
def tokenize(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        return_tensors=None
    )

def add_lexical_features(batch):
    batch["lexical_features"] = [
        lexical_features_one(text) for text in batch["text"]
    ]
    return batch

# cast labels
features = Features({
    "text": dataset.features["text"],
    "label": ClassLabel(num_classes=2),
})
dataset = dataset.cast(features)

# train-test split
dataset = dataset.train_test_split(
    test_size=0.20,
    seed=102,
    stratify_by_column="label"
)

# Apply tokenizer + lexical features
tokenized = dataset.map(tokenize, batched=True)
tokenized = tokenized.map(add_lexical_features, batched=True)

# remove unused columns
tokenized = tokenized.remove_columns(["text"])
tokenized = tokenized.rename_column("label", "labels")

# Convert everything to PyTorch tensors
tokenized.set_format("torch")

In [None]:
import torch
import torch.nn as nn
from transformers import RobertaModel, RobertaConfig

class RobertaWithFeatures(nn.Module):
    def __init__(self, feature_dim, num_labels=2):
        super().__init__()
        self.roberta = RobertaModel.from_pretrained("roberta-base")
        self.dropout = nn.Dropout(0.1)

        # The CLS embedding is 768-dimensional for roberta-base
        roberta_dim = self.roberta.config.hidden_size

        # Combine CLS + custom feature vector
        combined_dim = roberta_dim + feature_dim

        self.classifier = nn.Sequential(
            nn.Linear(combined_dim, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, num_labels)
        )

    def forward(self, input_ids, attention_mask, features):
        # RoBERTa forward
        outputs = self.roberta(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        cls_emb = outputs.last_hidden_state[:, 0, :]  # CLS token

        # Concatenate CLS with custom lexical features
        combined = torch.cat([cls_emb, features], dim=1)
        logits = self.classifier(combined)
        return logits

model = RobertaWithFeatures(feature_dim=5)

class SarcasmDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, features, labels):
        self.encodings = encodings
        self.features = torch.tensor(features, dtype=torch.float)
        self.labels = torch.tensor(labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["features"] = self.features[idx]
        item["labels"] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

In [None]:
from transformers import Trainer
import torch.nn as nn

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs["labels"]
        lexical_features = inputs["lexical_features"]

        outputs = model(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            lexical_features=lexical_features
        )

        loss_fct = nn.CrossEntropyLoss()
        loss = loss_fct(outputs, labels)

        return (loss, outputs) if return_outputs else loss

In [None]:
from transformers import Trainer

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        features = inputs.pop("features")
        outputs = model(**inputs, features=features)
        loss_fn = nn.CrossEntropyLoss()
        loss = loss_fn(outputs, labels)
        return (loss, outputs) if return_outputs else loss

In [None]:
training_args = TrainingArguments(
    output_dir= project_path + "/results",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    logging_strategy="epoch",
)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
)

trainer.train()

##RoBERTa

In [None]:
# Utilize GPU well
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.set_float32_matmul_precision("medium")

In [None]:
# Model defintion with Lexical Features
class RobertaWithLexical(nn.Module):
    def __init__(self, model_name="roberta-base", feature_dim=5, num_labels=2, dropout=0.1):
        super().__init__()
        self.roberta = RobertaModel.from_pretrained(model_name)
        hidden_size = self.roberta.config.hidden_size

        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(hidden_size + feature_dim, num_labels)

    def forward(self, input_ids=None, attention_mask=None, lexical_features=None, labels=None, **kwargs):
        # Roberta embeddings
        outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)

        cls_emb = outputs.last_hidden_state[:, 0, :]  # CLS token

        # Lexical features
        if lexical_features is None:
            lexical_features = torch.zeros(cls_emb.size(0), 0, device=cls_emb.device, dtype=cls_emb.dtype)
        else:
            lexical_features = lexical_features.to(cls_emb.device)
            if lexical_features.dtype != cls_emb.dtype:
                lexical_features = lexical_features.to(dtype=cls_emb.dtype)

        # Concatenate
        combined = torch.cat([cls_emb, lexical_features], dim=1)
        pooled = self.dropout(combined)
        logits = self.classifier(pooled)

        # Compute loss if labels provided
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))

        # Return proper HF output
        return SequenceClassifierOutput(loss=loss, logits=logits)

In [None]:
# Lexical feature generation
intensifiers = {"literally", "absolutely", "totally", "completely", "seriously"}
irony = {"yeah right", "sure", "as if", "i bet", "no way", "oh great", "oh yeah"}
sarcasm_punctuation = {"!","!!","!?","?!"}

def count_elongated(token):
    return 1 if re.search(r"(.)\1\1+", token) else 0

def lexical_features_one(text):
    text = str(text)
    text_lower = text.lower()
    tokens = re.findall(r'\w+|\S', text)

    intensifier_count = sum(t.lower() in intensifiers for t in tokens)
    irony_present = 1 if any(phrase in text_lower for phrase in irony) else 0
    punctuation_count = sum(text.count(p) for p in sarcasm_punctuation)
    cap_tokens = sum(1 for t in tokens if len(t) > 2 and t.isupper())
    cap_ratio = cap_tokens / max(len(tokens), 1)
    elongated_count = sum(count_elongated(t) for t in tokens)

    return [intensifier_count, irony_present, punctuation_count, cap_ratio, elongated_count]

In [None]:
# Data set up
MODEL_NAME = "roberta-base"
tokenizer = RobertaTokenizerFast.from_pretrained(MODEL_NAME)

X = sample_sarcasm['comment'].tolist()
y = sample_sarcasm['label'].tolist()

def add_lexical(batch):
    batch["lexical_features"] = [np.array(lexical_features_one(x), dtype=np.float32) for x in batch["text"]]
    return batch

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

dataset = Dataset.from_dict({"text": X, "labels": y})
dataset = dataset.train_test_split(test_size=0.2, seed=102)

dataset = dataset.map(add_lexical, batched=True)
dataset = dataset.map(tokenize, batched=True)
dataset = dataset.remove_columns(["text"])
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "lexical_features", "labels"])

We used a train test split here since we are using a lot of data to train. Training loops (even using Colab Pro GPUs) can take 20-30 minutes for one model. Using cross validation would multiply that 5x, which is just too exspensive and time consuming for the resources available.

In [None]:
# Training helpers

## Freeze
def freeze_roberta(model, train_last_n=4):
    total = model.roberta.config.num_hidden_layers
    train_from = total - train_last_n

    for name, param in model.roberta.named_parameters():
        match = re.search(r"encoder\.layer\.(\d+)\.", name)
        if match:
            layer = int(match.group(1))
            if layer < train_from:
                param.requires_grad = False
        elif "embeddings" in name:
            param.requires_grad = False

## Training Metrics
def metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "accuracy": accuracy_score(p.label_ids, preds),
        "f1": f1_score(p.label_ids, preds)
    }

In [None]:
# Training helpers

# Freeze
def freeze_roberta(model, train_last_n=4):
    total = model.roberta.config.num_hidden_layers
    train_from = total - train_last_n

    for name, param in model.roberta.named_parameters():
        match = re.search(r"encoder\.layer\.(\d+)\.", name)
        if match:
            layer = int(match.group(1))
            if layer < train_from:
                param.requires_grad = False
        elif "embeddings" in name:
            param.requires_grad = False

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        lexical = inputs.pop("lexical_features")

        outputs = model(
            **inputs,
            lexical_features=lexical,
            labels=labels,
            **kwargs
        )

        loss = outputs.loss

        return (loss, outputs) if return_outputs else loss

def metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "accuracy": accuracy_score(p.label_ids, preds),
        "f1": f1_score(p.label_ids, preds)
    }

#Good
# ==============================================================
# 6. TRAINING
# ==============================================================

model = RobertaWithLexical()
freeze_roberta(model, train_last_n=4)

# Try compile()
try:
    model = torch.compile(model)
    print("Compiled model for faster training.")
except Exception as e:
    print("torch.compile failed:", e)

#training_args = TrainingArguments(
#    output_dir="/content/drive/MyDrive/Final AIML Project - Sarcasm Detection/RobertaLarge",
#    per_device_train_batch_size=32,
#    per_device_eval_batch_size=64,
#    eval_strategy="epoch",
#    save_strategy="epoch",
#    remove_unused_columns=False,
#    fp16=True,
#    optim="adamw_torch_fused",
#    learning_rate=3e-5,
#    num_train_epochs=3,
#    dataloader_num_workers=4,
#    dataloader_pin_memory=True,
#    dataloader_persistent_workers=True,
#    report_to="none"
#)

#trainer = Trainer(
#    model=model,
#    args=training_args,
#    train_dataset=dataset["train"],
#    eval_dataset=dataset["test"],
#    tokenizer=tokenizer,
#    data_collator=default_data_collator,
#    compute_metrics=metrics,
#)


#trainer.train()
#trainer.save_model("/content/drive/MyDrive/Final AIML Project - Sarcasm Detection/RobertaLarge")

In [None]:
# Optuna Helpers
def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-6, 6e-5, log=True),
        "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.2),
        "warmup_ratio": trial.suggest_float("warmup_ratio", 0.0, 0.2)
    }

def model_init(trial=None):
    dropout = 0.1
    train_last_n = 4

    # If Optuna trial is active, override the defaults
    if trial is not None:
        dropout = trial.suggest_float("classifier_dropout", 0.05, 0.4)
        train_last_n = trial.suggest_int("train_last_n", 1, 6)

    model = RobertaWithLexical(dropout=dropout)

    freeze_roberta(model, train_last_n=train_last_n)

    return model

In [None]:
# Hyperparameter tuning - Ran for over an hour
training_args = TrainingArguments(
    output_dir="optuna_roberta_fast",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    num_train_epochs=1,
    warmup_ratio=0.0,        # overwritten by optuna
    learning_rate=2e-5,      # overwritten by optuna
    weight_decay=0.0,        # overwritten
    eval_strategy="epoch",
    save_strategy="no",
    logging_strategy="no",
    fp16=True,
    optim="adamw_torch_fused",
    report_to="none",
)

trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    compute_metrics=metrics,
)

study = optuna.create_study(
    direction="maximize",
    pruner=optuna.pruners.MedianPruner(n_warmup_steps=0)
)

best_trial = trainer.hyperparameter_search(
    n_trials=10,                     # fast
    hp_space=optuna_hp_space,
    backend="optuna",
    direction="maximize",
    compute_objective=lambda m: m["eval_f1"],   # F1 target
)

print(best_trial)

In [None]:
besthyperparameters={'learning_rate': 2.485002246616929e-05, 'weight_decay': 0.1802756018668522, 'warmup_ratio': 0.018819509939976123, 'classifier_dropout': 0.16749163342760462, 'train_last_n': 4}

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred.predictions, eval_pred.label_ids

    # argmax over classes
    preds = np.argmax(logits, axis=1)

    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average="weighted"
    )

    acc = accuracy_score(labels, preds)

    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

In [None]:
dataset

In [None]:
# Utilize best parameters
best = besthyperparameters # best_trial.hyperparameters
model = RobertaWithLexical(dropout=best["classifier_dropout"])
freeze_roberta(model, train_last_n=best["train_last_n"])

# Try to compile for speed
try:
    model = torch.compile(model)
    print("Compiled model for faster training.")
except Exception as e:
    print("torch.compile failed:", e)

training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/Final AIML Project - Sarcasm Detection/HypertunedRoBerta/training",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    eval_strategy="epoch",
    save_strategy="epoch",
    remove_unused_columns=False,
    fp16=True,
    optim="adamw_torch_fused",
    learning_rate=best["learning_rate"],
    warmup_ratio=best["warmup_ratio"],
    weight_decay=best["weight_decay"],
    num_train_epochs=3,
    dataloader_num_workers=4,
    dataloader_pin_memory=True,
    dataloader_persistent_workers=True,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model("/content/drive/MyDrive/Final AIML Project - Sarcasm Detection/HypertunedRoBerta/saved")

In [None]:
preds_output = trainer.predict(dataset["test"])

logits = preds_output.predictions[1]        # shape: (num_samples, 2)
preds = np.argmax(logits, axis=1)           # predicted classes: 0 or 1
labels = dataset["test"]["labels"]          # Get true labels directly from the dataset

from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

print("Accuracy:", accuracy_score(labels, preds))
print("F1:", f1_score(labels, preds))
print("Confusion matrix:\n", confusion_matrix(labels, preds))
print(classification_report(labels, preds))

In [None]:
def predict_sarcasm(model, text, tokenizer=tokenizer, device=None):
    model.eval()
    device = device or ("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    if isinstance(text, str):
        texts = [text]
    else:
        texts = text

    # Tokenize
    encodings = tokenizer(texts, truncation=True, padding="max_length", max_length=128, return_tensors="pt")

    # Lexical features
    lexical_features = torch.tensor([lexical_features_one(t) for t in texts], dtype=torch.float32)

    # Move to device
    input_ids = encodings["input_ids"].to(device)
    attention_mask = encodings["attention_mask"].to(device)
    lexical_features = lexical_features.to(device)

    with torch.no_grad():
        outputs = model(input_ids=input_ids,
                        attention_mask=attention_mask,
                        lexical_features=lexical_features)
        logits = outputs.logits
        probs = torch.softmax(logits, dim=1).cpu().numpy()
        preds = np.argmax(probs, axis=1)

    # Return single prediction if input was a string
    if isinstance(text, str):
        return {"predicted_label": int(preds[0]), "probabilities": probs[0]}
    else:
        return [{"predicted_label": int(p), "probabilities": prob} for p, prob in zip(preds, probs)]

statement = "Thank you for taking your sweet time on this!"

result = predict_sarcasm(model, statement)
print("Predicted label:", result["predicted_label"])  # 0 = not sarcastic, 1 = sarcastic
print("Probabilities:", result["probabilities"])

In [None]:
statements = [
    # Obvious sarcasm
    "Oh great, another Monday morning… just what I needed!",
    "I absolutely love it when my phone dies in the middle of an important call.",
    "Sure, I’d love to do more paperwork instead of going home early.",
    "Yeah right, because traffic is exactly what I was hoping for today.",
    "I totally enjoy waking up to the sound of my neighbor’s dog at 5 AM.",

    # Mild sarcasm
    "Wow, that movie was really… interesting.",
    "I’m so glad it’s raining again, just perfect for a picnic.",
    "Oh sure, because staying late at work is my favorite hobby.",

    # Neutral / serious
    "I had a sandwich for lunch.",
    "The sky is blue today.",
    "I am going to the grocery store after work.",
    "She won the award for best performance."
]

statement_results = []
results = predict_sarcasm(model, statements)
for r, s in zip(results, statements):
    statement_results.append({'Statement' : s, 'Is_Sarcastic' : r['predicted_label'] == 1})
display(pd.DataFrame(statement_results))

In [None]:
statements_tfidf_vec = vectorizer.transform(statements)
statements_lex_features = build_lexical_matrix(statements)

statements_combined_features = hstack([statements_tfidf_vec, statements_lex_features])

tfidf_predictions = clf.predict(statements_combined_features)

# Create a DataFrame to display the results
tfidf_results = []
for s, pred in zip(statements, tfidf_predictions):
    tfidf_results.append({'Statement': s, 'Is_Sarcastic': bool(pred)})

display(pd.DataFrame(tfidf_results))

Seems RoBerta gets it, but TFID gets it less so, even though the prediction metrics from the testing data is not too different.