# Joke Generator

<font size=3>The data was trained on Google Colab. Here I tried to Finetune the GPT2 Medium Weights. The gpt2-medium has a total of <i>354,823,168</i> parameters. I also explored the exploration vs exploitation problem where I did not choose the most probable token. Instead, I chose a random token from top n tokens thus making the results vary.<font>

In [None]:
!pip install transformers -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import tensorflow as tf

from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

import os
import numpy as np
import pandas as pd

from tqdm.notebook import tqdm

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### Data

<font size=3>I used the <b>shortjokes</b> dataset from kaggle. Which is already preprocessed to make the text clean. </font>

In [None]:
jokes = pd.read_csv('/content/gdrive/MyDrive/shortjokes.csv')
print(jokes.shape)
jokes.head()

(231657, 2)


Unnamed: 0,ID,Joke
0,1,"[me narrating a documentary about narrators] ""..."
1,2,Telling my daughter garlic is good for you. Go...
2,3,I've been going through a really rough period ...
3,4,"If I could have dinner with anyone, dead or al..."
4,5,Two guys walk into a bar. The third guy ducks.


In [None]:
jokeslist = jokes['Joke'].to_list()

In [None]:
Tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

<font size=3>We do padding and define the start and end tokens. These will be important for the model to understand where the joke ends. </font>

In [None]:
special_tokens_dict = {'pad_token': 'pad'}
num_added_toks = Tokenizer.add_special_tokens(special_tokens_dict)

START_TOKEN = '<|im_start|> '
END_TOKEN = ' <|end|>'

### Transforming Dataset

<font size=3>We convert Jokes to a tf dataset with specified batch sizes. The Start and End tokens are added to every joke.</font>

In [None]:
def convert_jokes_to_dataset(jokes, tokenizer, shuffle=True, batch_size=16, max_length=64):
    """
    Convert a list of jokes into a TensorFlow dataset with batches of encoded jokes.

    Parameters:
    jokes (list): A list of jokes.
    tokenizer (Tokenizer): A tokenizer to encode the jokes.
    shuffle (bool): Whether to shuffle the dataset. Default is True.
    batch_size (int): The number of samples per batch. Default is 16.
    max_length (int): The maximum length for each encoded joke. Default is 64.

    Returns:
    tf.data.Dataset: A TensorFlow dataset with batches of encoded jokes.
    """

    # Add start and end tokens to each joke
    jokes = [f"{START_TOKEN}{joke}{END_TOKEN}" for joke in jokes]
    
    encodings = [
        tokenizer.encode_plus(
            joke,
            truncation=True,
            add_special_tokens=True,
            max_length=max_length,
            padding="max_length"
        ) for joke in jokes
    ]
    
    input_ids = [encoding["input_ids"] for encoding in encodings]
    attention_masks = [encoding["attention_mask"] for encoding in encodings]
    
    inputs = {
        "input_ids": input_ids,
        "attention_mask": attention_masks
    }

    # Converting the inputs into a tf dataset
    dataset = tf.data.Dataset.from_tensor_slices(inputs)
    if shuffle:
        dataset = dataset.shuffle(buffer_size=len(jokes))
    dataset = dataset.batch(batch_size)

    return dataset

In [None]:
jokes_dataset = convert_jokes_to_dataset(jokeslist, Tokenizer)


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


### Model

<font size=3>The model we use is <b>TFGPT2LMHeadModel</b>. We use the gpt2-medium weights having 354M trainable params. This is a huge model over 1 GB in size. The <b>from_logits=True</b> parameter in loss function indicates that the function will apply a softmax function to transform the model's raw output into a probability distribution over the entire vocabulary. This will help us to explore the most probable next tokens. We used adam optimizer to update model.</font>

In [None]:
model = TFGPT2LMHeadModel.from_pretrained('gpt2-medium')
model.summary()

Downloading model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 354823168 
 r)                                                              
                                                                 
Total params: 354,823,168
Trainable params: 354,823,168
Non-trainable params: 0
_________________________________________________________________


In [None]:
loss_function = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5, epsilon=1e-08, clipnorm=1.0)

### Checkpoints

<font size=3>While training, I wanted the model to store checkpoints and to reload from the last checkpoint in case of any failure as I was working on colab.</font>

In [None]:
CHECKPOINT_PATH = "/content/gdrive/My Drive/Weights/JokeGenGPT2"

In [None]:
checkpoint_path = CHECKPOINT_PATH

ckpt = tf.train.Checkpoint(model = model)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

if ckpt_manager.latest_checkpoint:
  ckpt.restore(ckpt_manager.latest_checkpoint)
  print ('Latest checkpoint restored!!')

Latest checkpoint restored!!


### Training

<font size=3>The following function performs a forward pass, computes the loss, calculates the gradients, and updates the model's parameters.</font>

In [None]:
@tf.function
def perform_training_step(input_data):
    """
    Perform a training step for the model.

    Parameters:
    input_data (dict): A dictionary of input data.

    Returns:
    tf.Tensor: The loss value for the current training step.
    """
    
    with tf.GradientTape() as tape:
        model_outputs = model(input_data)
        logits = model_outputs[0]
        labels = input_data['input_ids']
        shifted_logits = logits[..., :-1, :]
        shifted_labels = labels[..., 1:]
        loss = loss_function(tf.reshape(shifted_labels, (-1,)),
                             tf.reshape(shifted_logits, (-1, shifted_logits.shape[-1])))
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    return loss

<font size=3>We then run the training steps and keep saving the model on every EPOCH. Ij ust ran 3 epochs as it was taking a lot of time to run.</font>

In [None]:
EPOCHS = 3
for epoch in range(EPOCHS):
    total_steps = 10000
    for batch, data in tqdm(enumerate(jokes_dataset.take(total_steps)), total=total_steps):
        loss = perform_training_step(data)
        if batch % 200 == 0:
            print('Epoch : {0} Batch : {1} ---- Loss : {2}'.format(epoch+1, batch+1, loss))
    ckpt_save_path = ckpt_manager.save()
    print ('Saving checkpoint for epoch {} at {}'.format(epoch+1,ckpt_save_path))

### Inference

<font size=3>We want to explore the next n tokens. We then choose randomly based on the probability distribution of the n explorable tokens. This adds variety to models generation. </font>

In [None]:
def get_optimal_token_with_exploration(token_probabilities, num_explore=5):
    """
    Get the optimal token with exploration. This will be a random choice from 
    the top most probable tokens.

    Parameters:
    token_probabilities (numpy array): Array of token probabilities.
    num_explore (int): Number of tokens to explore.

    Returns:
    optimal_token_id (int): ID of the optimal token.
    """
    top_token_indices = np.argpartition(token_probabilities, -num_explore)[-num_explore:]
    normalized_probabilities = token_probabilities[top_token_indices]
    normalized_probabilities /= np.sum(normalized_probabilities)
    random_choice = np.random.choice(num_explore, 1, p=normalized_probabilities)
    optimal_token_id = int(top_token_indices[random_choice][0])
    return optimal_token_id

<font size=3> To generate the text, we provide the initial Start token and decrease the exploration length as we get closer to max length. We then keep appending the token until we see an END token</font>

In [None]:
def create_joke(joke_length = 64):
    current_joke = tf.expand_dims(tf.convert_to_tensor(Tokenizer.encode(START_TOKEN)), 0)

    for pos in range(joke_length):
        output = model(current_joke)
        logits = output[0]
        softmax_logits = tf.nn.softmax(logits[0, -1], axis=0).numpy()
        if pos == 0:
            exploration_len = 50
        elif pos < 4:
            exploration_len = 15
        else:
            exploration_len = 4
        token_to_append = get_optimal_token_with_exploration(softmax_logits, exploration_len)
        current_joke = tf.concat([current_joke,tf.ones((1,1), dtype = tf.int32)*token_to_append],axis = 1)
        if token_to_append in Tokenizer.encode(END_TOKEN):
            return Tokenizer.decode(list(tf.squeeze(current_joke).numpy()))
    return ''

In [None]:
create_joke()

"<|start|>  Me: What did you think of the new iPhone? Wife: It's not the same as the original, but it is still pretty cool. <"