### Project Description

Humour, even for us humans, can be mysterious. It is no wonder that it poses a unique challenge for AI systems as well. Let us think about it - we all have that one friend who effortlessly brings laughter into our lives. They have a natural knack for timing, delivery, and a deep understanding of what makes something funny. But have we ever tried to explain why they are funny?

In this project, we aim to tackle the challenging task of creating an AI bot that excels in generating new jokes. This task is particularly difficult due to the complexities of humour. Humour is subjective and context-dependent, making it challenging for an AI system to understand and replicate effectively.

Additionally, jokes often rely on wordplay, sarcasm, and cultural references, which further complicates the task of generating original and funny jokes. Despite these challenges, we are determined to push the boundaries of AI and humour, striving to create a bot that can bring joy and laughter to users worldwide.

Meet **ChuckleChief**, our enthusiastic and curious novice AI companion, eager to unravel the mysteries of humour.

### Install and Import Dependencies

In [None]:
!pip install accelerate -U better_profanity datasets transformers

In [None]:
import nltk
nltk.download(["stopwords", "wordnet"])

In [None]:
import json
import os
import random
import re

import better_profanity as bp
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import torch
from tqdm import tqdm
from transformers import (
    AdamW,
    AutoConfig,
    DataCollatorForLanguageModeling,
    GPT2LMHeadModel,
    GPT2Tokenizer,
    Trainer,
    TrainingArguments
)

from datasets import Dataset

In [None]:
from google.colab import drive
drive.mount("/content/drive/")

### Load the Data

| Dataset Name | Description | Source | Number of Jokes | Format |
| --- | --- | --- | --- | --- |
| Short Jokes | A collection of short jokes in English | Kaggle | 231,657 | CSV |
| Joke Dataset - Stupidstuff | A dataset of English plaintext jokes from stupidstuff.org | GitHub | 3,770 | JSON |
| Joke Dataset - Wocka | A dataset of English plaintext jokes from wocka.com | GitHub | 10,000 | JSON |

The Short Jokes dataset from Kaggle is a large collection of short jokes in English that includes both one-liners and longer jokes. The dataset contains 231,657 jokes in CSV format.

The Joke Dataset from taivop/joke-dataset on GitHub includes two separate datasets: Stupidstuff and Wocka. The Stupidstuff dataset contains 3,770 English plaintext jokes scraped from stupidstuff.org, while the Wocka dataset contains 10,000 English plaintext jokes scraped from wocka.com. Both datasets are in JSON format and contain additional fields such as category, title, and rating.

Therefore, the total number of jokes in all three datasets combined is 245,427 jokes.

In [None]:
def load_joke_data():
    """Load joke data"""
    short_jokes_df = pd.read_csv("short_jokes.csv")

    with open("stupidstuff.json", "r") as file:
        stupidstuff = json.load(file)
    stupidstuff_df = pd.DataFrame(stupidstuff, columns=["id", "body", "category"])

    with open("wocka.json", "r") as file:
        wocka = json.load(file)
    wocka_df = pd.DataFrame(wocka, columns=["id", "title", "body", "category"])

    return short_jokes_df, stupidstuff_df, wocka_df


short_jokes_df, stupidstuff_df, wocka_df = load_joke_data()

In [None]:
short_jokes_df.head(2)

In [None]:
stupidstuff_df.head(2)

In [None]:
wocka_df.head(2)

### Preprocess Data

- Remove punctuation.
- Remove non-alphabetical characters.
- Remove numbers.
- Convert to lowercase.
- Strip leading/trailing whitespace.
- Remove newlines and carriage returns.
- Check if the joke is clean.

In [None]:
def preprocess_joke(joke):
    """Preprocess jokes"""
    joke = re.sub(r"[^\w\s]", "", joke)
    joke = re.sub(r"[^a-zA-Z\s]", "", joke)
    joke = re.sub("\d", "", joke)
    joke = joke.lower()
    joke = joke.strip()
    joke = joke.replace("\n", " ").replace("\r", "")

    return joke

In [None]:
short_jokes_df["Joke"] = short_jokes_df["Joke"].apply(preprocess_joke)
stupidstuff_df["body"] = stupidstuff_df["body"].apply(preprocess_joke)
wocka_df["body"] = wocka_df["body"].apply(preprocess_joke)

In [None]:
short_jokes_df.shape, stupidstuff_df.shape, wocka_df.shape

In [None]:
all_jokes = (
    short_jokes_df["Joke"].tolist()
    + stupidstuff_df["body"].tolist()
    + wocka_df["body"].tolist()
)

In [None]:
print(len(all_jokes))

In [None]:
def sample_jokes(all_jokes, n):
    """Sample jokes"""
    joke_indexes = np.random.randint(0, len(all_jokes), n)
    sampled_jokes = [all_jokes[index] for index in joke_indexes]
    return sampled_jokes

sampled_15000_jokes = sample_jokes(all_jokes, 15000)

In [None]:
def is_clean(joke):
    """Check if a joke is clean"""
    if bp.profanity.contains_profanity(joke):
        return False
    else:
        return True

In [None]:
def filter_jokes(jokes):
    """Filter out offensive jokes"""
    clean_jokes = []
    for joke in tqdm(jokes):
        if is_clean(joke):
            clean_jokes.append(joke)
    return clean_jokes

clean_jokes = filter_jokes(sampled_15000_jokes)

In [None]:
len(clean_jokes)

In [None]:
clean_jokes_df = pd.DataFrame(clean_jokes, columns=["jokes"])
clean_jokes_df.to_csv("clean_jokes_new.csv", index=False)

In [None]:
clean_jokes_df = pd.read_csv("clean_jokes_new.csv")
clean_jokes_df.dropna(inplace=True)

In [None]:
def get_subset(jokes_df, n):
    """Get data subset"""
    clean_jokes = [
        str(joke).strip() for joke in jokes_df["jokes"] if len(str(joke).strip()) >= 10
    ]
    random_jokes = random.sample(clean_jokes, n)
    subset_df = pd.DataFrame(random_jokes, columns=["jokes"])
    return subset_df

In [None]:
def split_data(jokes_df, train_size=0.8, val_size=0.1, test_size=0.1):
    """Split data"""
    total_size = len(jokes_df)

    # Calculate the number of examples for each set.
    train_num = int(train_size * total_size)
    val_num = int(val_size * total_size)
    test_num = int(test_size * total_size)

    all_indices = np.arange(total_size)
    train_indices = np.random.choice(all_indices, train_num, replace=False)
    val_indices = np.random.choice(
        np.setdiff1d(all_indices, train_indices), val_num, replace=False
    )
    test_indices = np.setdiff1d(
        all_indices, np.concatenate([train_indices, val_indices])
    )

    # Split data based on indices.
    train_df = jokes_df.iloc[train_indices]
    val_df = jokes_df.iloc[val_indices]
    test_df = jokes_df.iloc[test_indices]

    return train_df, val_df, test_df

### Model Comparison

- `bert-base-uncased`: BERT (Bidirectional Encoder Representations from Transformers) is a widely-used transformer-based model that has achieved state-of-the-art performance on various natural language processing tasks. The "base" variant refers to its medium-sized configuration, offering a balance between model size and performance. "uncased" indicates that the model treats all text as lowercase, disregarding capitalization. BERT incorporates a deep bidirectional transformer encoder, capturing contextual information from both preceding and following words. It is pretrained on a large corpus and can be fine-tuned for specific tasks.

- `distilbert-base-uncased`: DistilBERT is a distilled version of BERT, striking a good balance between performance and efficiency. It retains competitive performance while being smaller and faster than the original BERT model. Like BERT, "uncased" signifies that the model operates with lowercase text. DistilBERT achieves efficiency gains through techniques such as knowledge distillation and parameter reduction. Its reduced size makes it more manageable and quicker to fine-tune, particularly in scenarios with limited computational resources or smaller training datasets.  

- `gpt-2`: GPT-2 (Generative Pre-trained Transformer 2) is a cutting-edge language model explicitly designed for text generation tasks. Renowned for its ability to produce high-quality and coherent text, GPT-2 is particularly well-suited for joke generation. Built upon a transformer architecture with a substantial number of parameters, GPT-2 captures long-range dependencies in input text effectively.

We will start with fine-tuning `gpt-2 (124M parameter)` model for joke generation.

### Load Model and Tokeniser

In [None]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokeniser = GPT2Tokenizer.from_pretrained("gpt2")

In [None]:
print(f"The model has {sum(p.numel() for p in model.parameters()):,} trainable parameters")

In [None]:
# Access the transformer layers.
transformer_layers = model.transformer.h

# Print the information about each layer.
for i, layer in enumerate(transformer_layers):
    print(f"Layer {i}: {layer}")

In [None]:
# # Set the number of layers to train.
# num_layers_to_train = 4

# # Freeze all layers.
# for layer in model.transformer.h:
#     for param in layer.parameters():
#         param.requires_grad = False

# # Enable gradient computation for the top layers.
# for i, layer in enumerate(model.transformer.h[-num_layers_to_train:]):
#     for param in layer.parameters():
#         param.requires_grad = True

### Add Custom Tokens

In [None]:
bos = "<|endoftext|>"  # Beginning of sequence token
eos = "<|eos|>"        # End of sequence token
pad = "<|pad|>"        # Padding token

special_tokens = {"bos_token": bos, "eos_token": eos, "pad_token": pad}

# Add custom tokens to the tokeniser.
new_tokens = tokeniser.add_special_tokens(special_tokens)

# Model config with custom tokens.
config = AutoConfig.from_pretrained(
    "gpt2",
    bos_token_id=tokeniser.bos_token_id,
    eos_token_id=tokeniser.eos_token_id,
    pad_token_id=tokeniser.pad_token_id,
    output_hidden_states=False
)

# Load model with config.
model = GPT2LMHeadModel.from_pretrained("gpt2", config=config)

# Resize embeddings to include new tokens.
model.resize_token_embeddings(len(tokeniser))

### Define a Helper Function to Generate Jokes

In [None]:
def generate_jokes(model, tokeniser, prompt, num_jokes=3, max_len=30):
    filtered_jokes = []

    while len(filtered_jokes) < num_jokes:
        input_ids = tokeniser.encode(prompt, return_tensors="pt")
        generated_text_samples = model.generate(
            input_ids,
            max_length=max_len,
            num_return_sequences=num_jokes,
            repetition_penalty=1.5,
            temperature=0.75,
            do_sample=True
        )

        generated_jokes = [
            tokeniser.decode(joke, skip_special_tokens=True)
            for joke in generated_text_samples
        ]

        # Apply filter to keep only inoffensive jokes.
        for joke in generated_jokes:
            if is_clean(joke):
                filtered_jokes.append(joke)

    return filtered_jokes

In [None]:
clean_jokes_subset = get_subset(clean_jokes_df, 11000)
train_df, val_df, test_df = split_data(clean_jokes_subset)
train_df.shape, val_df.shape, test_df.shape

In [None]:
train_jokes = Dataset.from_pandas(train_df[["jokes"]])
val_jokes = Dataset.from_pandas(val_df[["jokes"]])

### Tokenise and Pad Data

In [None]:
tokenised_train_jokes = train_jokes.map(
    lambda x: tokeniser(x["jokes"], padding=True),
    batched=True
)

tokenised_val_jokes = val_jokes.map(
    lambda x: tokeniser(x["jokes"], padding=True),
    batched=True
)

### Set up the Training Arguments and Data Collator

In [None]:
model_path = "./ChuckleChief"

training_args = TrainingArguments(
    output_dir=model_path,
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=1e-5,
    logging_dir=model_path,
    prediction_loss_only=True,
    evaluation_strategy="epoch",
    save_strategy="epoch"
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokeniser, mlm=False)

### Train the Model

In [None]:
torch.cuda.empty_cache()

In [25]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenised_train_jokes,
    eval_dataset=tokenised_val_jokes,
)
trainer.train()

Epoch,Training Loss,Validation Loss
1,4.6519,4.46903
2,4.4296,4.416205


('./ChuckleChief/tokenizer_config.json',
 './ChuckleChief/special_tokens_map.json',
 './ChuckleChief/vocab.json',
 './ChuckleChief/merges.txt',
 './ChuckleChief/added_tokens.json')

### Save the Model

In [None]:
trainer.save_model()
tokeniser.save_pretrained(model_path)

### Evaluate the Model

In [26]:
trainer.evaluate()

{'eval_loss': 4.416205406188965,
 'eval_runtime': 76.1981,
 'eval_samples_per_second': 14.436,
 'eval_steps_per_second': 3.609,
 'epoch': 2.0}

### Generate Jokes

In [None]:
loaded_model = GPT2LMHeadModel.from_pretrained(model_path)
loaded_tokeniser = GPT2Tokenizer.from_pretrained(model_path)

In [41]:
prompt = "Here is a joke filled with harmless humour:"
generate_jokes(loaded_model, loaded_tokeniser, prompt)

['Here is a joke filled with harmless humour: youre going to spend days out in your apartment laughing at the news and then getting up early on sund',
 'Here is a joke filled with harmless humour: everyone laughs at the moon because that means it takes one to walk up on an emaciated alien you',
 'Here is a joke filled with harmless humour: youre not going to be offended by it unless your name goes on top of the list and shows up']

### Acknowledgement: Datasets

### [Short Jokes](https://www.kaggle.com/datasets/abhinavmoudgil95/short-jokes)
### [Stupidstuff](https://github.com/taivop/joke-dataset#stupidstuffjson)
### [Wocka](https://github.com/taivop/joke-dataset#wockajson)