# Funnybot's Evaluator Models

## Overview

This notebook builds and evaluates models used by Funnybot to evaluate/rate jokes.

I'm using the [GPT2 transformers](https://huggingface.co/gpt2) as well as other utilities from [Hugging Face](https://huggingface.co) to train and evaluate the models in this project.

Due to my current hardware contraints and reluctance to spend some of my hard-earned money on cloud GPU's (in other words, I'm a cheapskate), this model does not work great, however, I hope that this notebook may be helpful to someone strugging to find a straigh-forward and complete tutorial on text classification on the web.

The models from Hugging Face are basically pre-trained models that already have been optimized to have an understanding of natural language and classify natural language text. In this notebook we are going to fine-tune a model to classify text from a particular class, i.e., jokes.

## Dependencies

In [1]:
from datasets import Dataset
import datetime

import evaluate
from evaluate import evaluator
from evaluate import load

import pandas as pd

from pathlib import Path

import profanity_check

import string

import torch
from torch import nn
from torch.utils.data import DataLoader

from tqdm.auto import tqdm

from transformers import AutoTokenizer, DataCollatorWithPadding, GPT2Config, GPT2Tokenizer, GPT2ForSequenceClassification
from torch.optim import AdamW
from transformers import get_scheduler
from transformers import TextClassificationPipeline

This notebook requires the native dependencies from the "Comedy Club" project. You may install them by running:

```
cd <"Comedy Club" project root>
pip install .
```

Additionally, the dependencies defined in [requirements.txt](../requirements-dev.txt) are required. You may install them by running:

```
pip install -r requirements-dev.txt
```

## Raw Data

We are using the [Jester Dataset](https://www.kaggle.com/datasets/vikashrajluhaniwal/jester-17m-jokes-ratings-dataset). Not a very large dataset, which is to blame for the lack of performance of the result model.

The dataset is actually split in two, one has the jokes themselves and the other the rates. The two datasets are related by an ID column.

In [3]:
data_dir = Path().absolute().parent.parent / "data"

joke_ratings_items_df = pd.read_csv(data_dir / "joke-ratings-items.csv")

joke_ratings_items_df

Unnamed: 0,jokeId,jokeText
0,1,"A man visits the doctor. The doctor says ""I ha..."
1,2,This couple had an excellent relationship goin...
2,3,Q. What's 200 feet long and has 4 teeth? \n\nA...
3,4,Q. What's the difference between a man and a t...
4,5,Q.\tWhat's O. J. Simpson's Internet address? \...
...,...,...
145,146,America: 8:00 - Welcome to work! 12:00 - Lunch...
146,147,It was the day of the big sale. Rumors of the ...
147,148,"Recently a teacher, a garbage collector, and a..."
148,149,"A little girl asked her father, ""Daddy? Do all..."


In [4]:
joke_ratings_values_df = pd.read_csv(data_dir / "joke-ratings-values.csv")

joke_ratings_values_df

Unnamed: 0,userId,jokeId,rating
0,1,5,0.219
1,1,7,-9.281
2,1,8,-9.281
3,1,13,-6.781
4,1,15,0.875
...,...,...,...
1761434,63978,57,-8.531
1761435,63978,24,-9.062
1761436,63978,124,-9.031
1761437,63978,58,-8.656


In [5]:
pd.set_option('max_colwidth', None)

joke_ratings_df = pd.merge(joke_ratings_items_df, joke_ratings_values_df, on="jokeId")

joke_ratings_df = joke_ratings_df.groupby(["jokeId", "jokeText"])["rating"].mean().reset_index()
joke_ratings_df = joke_ratings_df.rename(columns={"jokeText": "text", "rating": "labels"})
joke_ratings_df = joke_ratings_df.drop(columns=["jokeId"])

joke_ratings_df[:3]

Unnamed: 0,text,labels
0,"Q.\tWhat's O. J. Simpson's Internet address? \nA.\tSlash, slash, backslash, slash, slash, escape.\n",-1.756331
1,How many feminists does it take to screw in a light bulb?\nThat's not funny.\n,-1.80923
2,Q. Did you hear about the dyslexic devil worshiper? \n\nA. He sold his soul to Santa.\n,-0.67201


We are going to normalize our jokes just for the sake of being clerical, however, this doesn't seem to have a huge effect on the models performance.

The following normalization will be applied to all jokes:

- Remove non ASCII characters (with the assumption that jokes are always in English).
- Remove some non-standard punctuation characters.
- Remove excessive spacing.

In [5]:
def normalize_sentence(row):
    characters_to_remove = string.punctuation.replace(".", "").replace("-", "").replace("'", "")
    
    text = row["text"].encode('ascii', errors='ignore').decode()
    text = " ".join(text.split()).strip()
    text = " ".join(text.split(characters_to_remove)).strip()

    return text

joke_ratings_df["text"] = joke_ratings_df.apply(normalize_sentence, axis=1)

joke_ratings_df[:3]

Unnamed: 0,text,labels
0,"Q. What's O. J. Simpson's Internet address? A. Slash, slash, backslash, slash, slash, escape.",-1.756331
1,How many feminists does it take to screw in a light bulb? That's not funny.,-1.80923
2,Q. Did you hear about the dyslexic devil worshiper? A. He sold his soul to Santa.,-0.67201


As you may have noticed, the ratings are actualy float numbers and for our application we want rates as integers in the interval [1, 10]. For this reason, we are going to normalize these ratings.

In [6]:
max_rating = joke_ratings_df["labels"].max()
min_rating = joke_ratings_df["labels"].min()

(min_rating, max_rating)

(-2.7495735330223425, 3.7143809194009174)

Here we normalize the ratings. The models only work with labels in the interval [0, N]. This is because the tokenizer expects label ID's.

Thus, given that we need 10 labels, our interval will be [0, 9].

In [7]:
max_rating = joke_ratings_df["labels"].max()
min_rating = joke_ratings_df["labels"].min()

max_target_rating = 9
min_target_rating = 0


def normalize_rating(row):
    return round((row["labels"] - min_rating) / (max_rating - min_rating) * (max_target_rating - min_target_rating) + min_target_rating)

joke_ratings_df["labels"] = joke_ratings_df.apply(normalize_rating, axis=1)

joke_ratings_df[:3]

Unnamed: 0,text,labels
0,"Q. What's O. J. Simpson's Internet address? A. Slash, slash, backslash, slash, slash, escape.",1
1,How many feminists does it take to screw in a light bulb? That's not funny.,1
2,Q. Did you hear about the dyslexic devil worshiper? A. He sold his soul to Santa.,3


Here we check our labels to see if they were normalized correctly:

In [8]:
joke_ratings_df.describe()

Unnamed: 0,labels
count,140.0
mean,6.1
std,2.022535
min,0.0
25%,5.0
50%,6.5
75%,8.0
max,9.0


In [9]:
sorted(joke_ratings_df["labels"].unique())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

We want our model to output labels in the interval [1, 10], thus, we are required to create an ID to label mapping. This mapping will be use in the model's configuration:

In [10]:
id2label = {id:id + 1 for id in range(0, max_target_rating + min_target_rating + 1)}

id2label

{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10}

## Creating Datasets

In [11]:
joke_ratings_df = joke_ratings_df.reset_index(drop=True)

full_dataset = Dataset.from_pandas(joke_ratings_df)

raw_datasets = full_dataset.train_test_split(test_size=0.3)

raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 98
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 42
    })
})

We no longer need the jokes dataframe, given that we have the raw dataset, thus, let's release some memory:

In [12]:
def wrap_text(example):
    example["text"] = "<|startoftext|>" + example["text"] + "<|endoftext|>"
    return example

raw_datasets["train"] = raw_datasets["train"].map(wrap_text)

raw_datasets["train"]["text"][:2]

Map:   0%|          | 0/98 [00:00<?, ? examples/s]

["<|startoftext|>A radio conversation of a US naval ship with Canadian authorities ... Americans: Please divert your course 15 degrees to the North to avoid a collision. Canadians: Recommend you divert YOUR course 15 degrees to the South to avoid a collision. Americans: This is the Captain of a US Navy ship. I say again, divert YOUR course. Canadians: No. I say again, you divert YOUR course. Americans: This is the aircraft carrier USS LINCOLN, the second largest ship in the United States' Atlantic Fleet. We are accompanied by three destroyers, three cruisers and numerous support vessels. I demand that you change your course 15 degrees north, that's ONE FIVE DEGREES NORTH, or counter-measures will be undertaken to ensure the safety of this ship. Canadians: This is a lighthouse. Your call.<|endoftext|>",
 '<|startoftext|>A man arrives at the gates of heaven. St. Peter asks, "Religion?" The man says, "Methodist." St. Peter looks down his list, and says, "Go to room 24, but be very quiet a

## Creating a Tokenizer

As for the checkpoint, we are going to use [gpt2](https://huggingface.co/gpt2), which is suitable for text generation and small enough for the purposes of this challenge (development).

Hugging Face makes available the following GPT2 checkpoints for transformers:

- gpt2 (137M parameters)
- gpt2-medium (380M parameters)
- gpts-large (821M parameters)
- gpt2-xl (1.5B parameters)

Even gpt2-medium was challenging for my local computer, thus, we will stick to "gpt2" for the purposes of this project.

In [13]:
checkpoint = "gpt2"

The tokenizer we are going to use is the following (suitable for our classifier):

In [14]:
tokenizer = GPT2Tokenizer.from_pretrained(
    checkpoint, 
    bos_token="<|startoftext|>", 
    eos_token="<|endoftext|>",
    padding=True,
    pad_token="<|pad|>", 
    truncation=True
)

inputs = tokenizer("<|startoftext|>This is my sentence.<|endoftext|>")
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


['<|startoftext|>', 'This', 'Ġis', 'Ġmy', 'Ġsentence', '.', '<|endoftext|>']

>***Note:***
>
>*The warning is just fine. We did add new tokens to the vocabulary (beggining/ending of sentence, and padding tokens) and we are going to fine-tune the word embeddings when we train the model.*

In [15]:
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

tokenized_datasets["train"]

Map:   0%|          | 0/98 [00:00<?, ? examples/s]

Map:   0%|          | 0/42 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 98
})

In [16]:
tokenized_datasets["test"]

Dataset({
    features: ['text', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 42
})

The original feature columns can not be used for training, thus they will be removed. We also change the format to "torch":

In [17]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets.set_format("torch")

tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 98
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 42
    })
})

## Creating a Data Loader

The data loader allows us to feed our dataset by batches during training. First we need to create a data collator:

In [18]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

data_collator

DataCollatorWithPadding(tokenizer=GPT2Tokenizer(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': AddedToken("<|pad|>", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=True), padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

In [19]:
train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, collate_fn=data_collator
)

for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([1]),
 'input_ids': torch.Size([1, 111]),
 'attention_mask': torch.Size([1, 111])}

## Training our Model

Given that our purpose is text classification, [GPT2ForSequenceClassification](https://huggingface.co/docs/transformers/v4.15.0/model_doc/gpt2#transformers.GPT2ForSequenceClassification) is suitable for the job.

The following might seem odd, but it's the [way recommended by Hugging Face](https://huggingface.co/docs/transformers/generation_strategies). We need to save a pre-trained model to temporary directory, modify its configuration, and then load the model from the temporary directory.

In [20]:
configuration = GPT2Config.from_pretrained(
    checkpoint,
    output_hidden_states=False,
    num_labels=len(id2label),
    id2label=id2label
)

model = GPT2ForSequenceClassification.from_pretrained(checkpoint, config=configuration)
model.resize_token_embeddings(len(tokenizer))

outputs = model(**batch)
outputs[:2]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


(tensor(14.5046, grad_fn=<NllLossBackward0>),
 tensor([[  1.4496, -10.4196,   0.4926,   4.2343,   2.1749, -10.0697,   5.5400,
           -4.4224,  -7.0931,  10.0679]], grad_fn=<IndexBackward0>))

>***Note:***
>
>*The warning is just fine. Training the model is exactly what we intend to do.*

Here are the parameters we will be using for our training:

In [21]:
num_epochs = 5
num_training_steps = num_epochs * len(train_dataloader)
num_warmup_steps = int(num_training_steps * 0.1)

(num_training_steps, num_warmup_steps)

(490, 49)

To traing our model using [PyTorch](https://pytorch.org/) we will require the following components:

- Optimizer (in case you are not familiar with ML, this optimizer implements stochastic gradient descent for neural networks).
- Scheduler (a component that manages the iterations required to train the model).
- Tokenizer (created in previous sections).
- Data Loader (created in previous sections).

We are going to use ADAM as our optimizer:

In [22]:
optimizer = AdamW(model.parameters(), lr=5e-5)

The device used depends on our own hardware:

In [23]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device

device(type='cpu')

Here's our scheduler:

In [24]:
scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps,
)

Finally, we train our model using our dataset:

In [25]:
progress_bar = tqdm(range(num_training_steps))

model.train()
for _ in range(num_epochs):
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/490 [00:00<?, ?it/s]

## Testing our Model

In [26]:
pipe = TextClassificationPipeline(
    model=model.to(device),
    tokenizer=tokenizer
)

jokes = [
    "My dog has no nose. How does it smell? Awful!",
    
    "I can predict the motion of the heavenly bodies, but not the madness of people.",

    """Acid rain. Drug addictions. International terrorism. Freeway killers.
       Now, more than ever, is important to remember the true meaning of Christmas.
       Don't miss Charles Dickens' immortal classic. "Scrooge".
       Your life might just depend on it.
    """,

    "I always like walking in the rain, so no one can see me crying.",
]
    
evaluator_results = [{"joke": joke, "label": pipe(joke)} for joke in jokes]

pd.DataFrame(evaluator_results, index=[0] * len(jokes)).reset_index(drop=True)

Unnamed: 0,joke,label
0,My dog has no nose. How does it smell? Awful!,"[{'label': 10, 'score': 0.5223402380943298}]"
1,"I can predict the motion of the heavenly bodies, but not the madness of people.","[{'label': 7, 'score': 0.45346590876579285}]"
2,"Acid rain. Drug addictions. International terrorism. Freeway killers.\n Now, more than ever, is important to remember the true meaning of Christmas.\n Don't miss Charles Dickens' immortal classic. ""Scrooge"".\n Your life might just depend on it.\n","[{'label': 8, 'score': 0.2282678782939911}]"
3,"I always like walking in the rain, so no one can see me crying.","[{'label': 7, 'score': 0.45665329694747925}]"


## Evaluating our Model

We need to way to evaluate the performance of our models, thus we will use an evaluator. The evaluator requires an evaluation dataset:

In [27]:
eval_dataloader = DataLoader(
    tokenized_datasets["test"], shuffle=True, collate_fn=data_collator
)

for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([1]),
 'input_ids': torch.Size([1, 149]),
 'attention_mask': torch.Size([1, 149])}

Here we generate the metrics:

In [28]:
metric = evaluate.load("accuracy")

model.eval()

for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model.to(device)(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.19047619047619047}

Here are some results for different training parameters:

| epochs| accuracy          |
| ------| ----------------- |
| 3     |0.23809523809523808|
| 4     |0.09523809523809523|
| 5     |0.16666666666666666|
| 6     |0.11904761904761904|


Not great results, but they can be improved with a larger GPT2 model and more data.

## Saving our Models
### Local Environment

We are required to save our models to the directory the application is expecting. The evaluator's model and tokenizer will be saved under the directory `./joke-evaluator`.

In [29]:
save_dir = Path().absolute().parent / "joke-evaluator"

model.save_pretrained(save_dir / "model")
tokenizer.save_pretrained(save_dir / "tokenizer")

('/home/marcio/workspace/konfuzio-ai/ai-comedy-club/bots/funnybot/transformers/joke-evaluator/tokenizer/tokenizer_config.json',
 '/home/marcio/workspace/konfuzio-ai/ai-comedy-club/bots/funnybot/transformers/joke-evaluator/tokenizer/special_tokens_map.json',
 '/home/marcio/workspace/konfuzio-ai/ai-comedy-club/bots/funnybot/transformers/joke-evaluator/tokenizer/vocab.json',
 '/home/marcio/workspace/konfuzio-ai/ai-comedy-club/bots/funnybot/transformers/joke-evaluator/tokenizer/merges.txt',
 '/home/marcio/workspace/konfuzio-ai/ai-comedy-club/bots/funnybot/transformers/joke-evaluator/tokenizer/added_tokens.json')

### Hugging Face's Hub

You will need a Hugging Face's acccount and of course you will only be able to push to repositories in your own account.

In [33]:
%%script false --no-raise-error

model.push_to_hub("marciogualtieri/funnybot-joke-evaluator-model")
tokenizer.push_to_hub("marciogualtieri/funnybot-joke-evaluator-tokenizer")

pytorch_model.bin:   0%|          | 0.00/498M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/marciogualtieri/funnybot-joke-evaluator-tokenizer/commit/de871bc28abe811c2de46cf777d51bcfe5b374ea', commit_message='Upload tokenizer', commit_description='', oid='de871bc28abe811c2de46cf777d51bcfe5b374ea', pr_url=None, pr_revision=None, pr_num=None)