### Introduction

Training a model to accurately rate jokes can be challenging due to several reasons. Firstly, humour is highly subjective, varying greatly from person to person. Additionally, humour is context-dependent, making it challenging for an AI system to understand and replicate effectively. Jokes often rely on wordplay, sarcasm, and cultural references, which further complicates the task. 

In this section of the project, the focus is on fine-tuning the `distilbert-base-uncased` model on a custom dataset for predicting the rating for jokes. `distilbert-base-uncased` is a powerful language model that has been pre-trained on a massive corpus of text data. This pre-training equips the model with a strong understanding of language and context. By fine-tuning the model specifically for joke rating, we can leverage its language comprehension capabilities to generate ratings. 

The generated ratings can provide a helpful reference but should not be taken as an absolute measure of a joke's comedic worth.

### Install and Import Dependencies

In [None]:
!pip install accelerate -U datasets transformers

In [None]:
import nltk
nltk.download("vader_lexicon")

In [None]:
import os
import random

import numpy as np
import pandas as pd
import torch
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from transformers import (
    DistilBertForSequenceClassification,
    DistilBertTokenizer,
    EarlyStoppingCallback,
    IntervalStrategy,
    Trainer,
    TrainingArguments
)

from datasets import Dataset

In [None]:
from google.colab import drive
drive.mount("/content/drive/")

### Load Data

In [None]:
df = pd.read_csv("../datasets/clean_jokes_2.csv")

### Rate Jokes

We are using the VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analyser to generate sentiment scores for each joke and map the compound score in the range [-1, 1] to a 1-10 rating scale. First, by adding 1, we shift the range to [0, 2] and then multiply by 5 to scale the range to [0, 10]. We ensure that the rating is at least 1. Here, we are leveraging the scaled sentiment scores to produce target ratings for this dataset.

In [None]:
analyser = SentimentIntensityAnalyzer()

In [None]:
def rate_joke(joke):
    """Rates a joke on a scale of 1-10 based on its sentiment."""
    scores = analyser.polarity_scores(joke)
    rating = round((scores["compound"] + 1) * 5)
    rating = max(rating, 1)
    return rating

In [None]:
tqdm.pandas(desc="Rating jokes")

In [None]:
df["ratings"] = df["jokes"].progress_apply(rate_joke)
df["ratings"] = df["ratings"].astype(float)

### Train Test Split

In [None]:
train_df, test_df = train_test_split(df, test_size=0.1, random_state=1)
train_df, val_df = train_test_split(train_df, test_size=len(test_df), random_state=1)

In [None]:
train_df.shape, test_df.shape, val_df.shape

### Load Model and Tokeniser

In [None]:
checkpoint = "distilbert-base-uncased"
tokeniser = DistilBertTokenizer.from_pretrained(checkpoint)
model = DistilBertForSequenceClassification.from_pretrained(checkpoint, num_labels=1)

### Preprocess Data

In [None]:
def tokenise_data(tokeniser, df):
    """Tokenise data."""
    encodings = tokeniser(list(df["jokes"]), truncation=True, padding=True)
    return encodings


train_encodings = tokenise_data(tokeniser, train_df)
val_encodings = tokenise_data(tokeniser, val_df)
test_encodings = tokenise_data(tokeniser, test_df)

In [None]:
def create_dataset(encodings, ratings):
    """Create dataset."""
    input_ids = torch.tensor(encodings["input_ids"], dtype=torch.long)
    attention_mask = torch.tensor(encodings["attention_mask"], dtype=torch.float)
    labels = torch.tensor(ratings.values, dtype=torch.float)

    dataset = torch.utils.data.TensorDataset(input_ids, attention_mask, labels)
    return dataset


train_dataset = create_dataset(train_encodings, train_df["ratings"])
val_dataset = create_dataset(val_encodings, val_df["ratings"])
test_dataset = create_dataset(test_encodings, test_df["ratings"])

### Evaluation Metrics

In [None]:
def compute_metrics(eval_pred):
    """Compute evaluation metrics."""
    predictions, labels = eval_pred.predictions, eval_pred.label_ids

    mse = np.mean((predictions - labels) ** 2)
    rmse = np.sqrt(mse)
    mae = np.mean(np.abs(predictions - labels))

    return {"mse": mse, "rmse": rmse, "mae": mae}

### Set up the Training Arguments and Data Collator

In [None]:
model_path = "./ChuckleChiefRater"

training_args = TrainingArguments(
    output_dir=model_path,
    num_train_epochs=4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir=model_path,
    prediction_loss_only=False,
    evaluation_strategy=IntervalStrategy.STEPS,
    eval_steps=500,
    save_steps=2000,
    save_total_limit=2,
    load_best_model_at_end=True
)

data_collator = lambda features: {
    "input_ids": torch.stack([f[0] for f in features]),
    "attention_mask": torch.stack([f[1] for f in features]),
    "labels": torch.stack([f[2] for f in features])
}

### Train the Model

In [None]:
torch.cuda.empty_cache()

In [20]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=2)],
    compute_metrics=compute_metrics
)

trainer.train()



Step,Training Loss,Validation Loss,Mse,Rmse,Mae
500,4.6879,2.334007,6.614991,2.571963,2.069519
1000,2.0871,1.648909,6.806559,2.608938,2.0627
1500,1.4056,1.420838,7.791146,2.791262,2.205038
2000,1.1679,1.378124,8.595682,2.931839,2.321136
2500,0.9458,1.154988,8.322764,2.88492,2.288404
3000,0.6809,1.027863,8.051106,2.837447,2.238993


Step,Training Loss,Validation Loss,Mse,Rmse,Mae
500,4.6879,2.334007,6.614991,2.571963,2.069519
1000,2.0871,1.648909,6.806559,2.608938,2.0627
1500,1.4056,1.420838,7.791146,2.791262,2.205038
2000,1.1679,1.378124,8.595682,2.931839,2.321136
2500,0.9458,1.154988,8.322764,2.88492,2.288404
3000,0.6809,1.027863,8.051106,2.837447,2.238993
3500,0.6708,0.977394,7.8079,2.794262,2.202772
4000,0.497,0.965505,8.102925,2.846564,2.251322
4500,0.4573,0.956578,8.11199,2.848155,2.248373


TrainOutput(global_step=4784, training_loss=1.3447293501633863, metrics={'train_runtime': 1901.8502, 'train_samples_per_second': 20.111, 'train_steps_per_second': 2.515, 'total_flos': 5066522707943424.0, 'train_loss': 1.3447293501633863, 'epoch': 4.0})

### Evaluate the Model


In [21]:
trainer.evaluate(test_dataset)

{'eval_loss': 1.0274949073791504,
 'eval_mse': 8.364996910095215,
 'eval_rmse': 2.89223051071167,
 'eval_mae': 2.2968695163726807,
 'eval_runtime': 17.8778,
 'eval_samples_per_second': 66.899,
 'eval_steps_per_second': 8.39,
 'epoch': 4.0}

### Save the Model


In [None]:
new_model_path = os.path.join(model_path, "model")
if not os.path.exists(new_model_path):
    os.makedirs(new_model_path)

trainer.save_model(new_model_path)
tokeniser.save_pretrained(new_model_path)

### Rate Jokes

In [None]:
loaded_model = DistilBertForSequenceClassification.from_pretrained(new_model_path)
loaded_tokeniser = DistilBertTokenizer.from_pretrained(new_model_path)

In [None]:
def predict_rating(model, tokeniser, joke):
    encoding = tokeniser.encode_plus(
        joke, truncation=True, padding=True, return_tensors="pt"
    )

    input_ids = encoding["input_ids"].to(model.device)
    attention_mask = encoding["attention_mask"].to(model.device)

    model.eval()
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)

    predictions = outputs.logits.item()

    return predictions

In [None]:
jokes = [
    "Why don't scientists trust atoms? Because they make up everything!",
    "Why don't skeletons fight each other? They don't have the guts!",
    "What do you call a fish with no eyes? Fsh!",
    "Why did the scarecrow win an award? Because he was outstanding in his field!",
    "Why don't eggs tell jokes? Because they might crack up!",
    "What do you call fake spaghetti? An impasta!",
    "Why did the computer go to the doctor? Because it had a virus!",
    "What do you call a bear with no teeth? A gummy bear!",
    "Why did the math book look sad? Because it had too many problems!"
]

In [26]:
ratings = [predict_rating(model, tokeniser, joke) for joke in jokes]

for i in range(len(jokes)):
    print(jokes[i], ratings[i])

Why don't scientists trust atoms? Because they make up everything! 5.096904277801514
Why don't skeletons fight each other? They don't have the guts! 4.697202682495117
What do you call a fish with no eyes? Fsh! 3.0918586254119873
Why did the scarecrow win an award? Because he was outstanding in his field! 9.265880584716797
Why don't eggs tell jokes? Because they might crack up! 5.2596869468688965
What do you call fake spaghetti? An impasta! 2.6296956539154053
Why did the computer go to the doctor? Because it had a virus! 5.1471662521362305
What do you call a bear with no teeth? A gummy bear! 3.020991325378418
Why did the math book look sad? Because it had too many problems! 1.6071512699127197
