# Fine-tuning GPT-2 on a jokes dataset in PyTorch

This notebook was created as a part of a blog post - [Fine-tuning large Transformer models on a single GPU in PyTorch - Teaching GPT-2 a sense of humor](https://mf1024.github.io/2019/11/12/Fun-With-GPT-2/). Here I demonstrate how to fine-tune a pre-trained GPT-2 model on a jokes dataset.

Let's see if the model can learn to crack some jokes!

For this experiment, I will use a pre-trained GPT-2 medium-sized model from the huggingface [transformers repository](https://github.com/huggingface/transformers).

#### If you haven't yet, check out the notebook in this [gist](https://gist.github.com/mf1024/430d7fd6ff527350d3e4b5bda0d8614e) where use the same pretrained model to generate text.

In [9]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, DataCollatorWithPadding
import numpy as np

import logging
logging.getLogger().setLevel(logging.CRITICAL)

import warnings
warnings.filterwarnings('ignore')

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

RuntimeError: Failed to import transformers.models.gpt2.modeling_gpt2 because of the following error (look up to see its traceback):
partially initialized module 'torchvision' has no attribute 'extension' (most likely due to a circular import)

In [2]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
# добавляем токен для паддинга, чтобы выровнять наши текстовые последовательности
# иначе дообучение не проходит
tokenizer.add_special_tokens({"pad_token": "<|pad|>"})

model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
# увеличиваем размер словаря токенизатора после добавления паддинга
model.resize_token_embeddings(len(tokenizer))
model = model.to(device)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


In [3]:
def choose_from_top(probs, n=5):
    ind = np.argpartition(probs, -n)[-n:]
    top_prob = probs[ind]
    top_prob = top_prob / np.sum(top_prob) # Normalize
    choice = np.random.choice(n, 1, p = top_prob)
    token_id = ind[choice][0]
    return int(token_id)

### PyTorch Dataset module for Short jokes dataset

For fine-tuning the GPT2 model, I will use this [Short Jokes dataset](https://www.kaggle.com/abhinavmoudgil95/short-jokes) published on Kaggle. After each joke, I add "<|endofext|>" which is recognized by the GPT2 model as and end of text marker. The marker will allow me to concatenate many jokes in a single input sequence.

In [4]:
from torch.utils.data import Dataset
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
import os
import json
import csv

dataset = load_dataset("allenai/sciq")
train_dataset = dataset['train']

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

README.md:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.99M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/339k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/343k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11679 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [5]:
train_dataset

Dataset({
    features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
    num_rows: 11679
})

In [6]:
# тк датасет взят с hf, он уже хорошо обработан
# осталось привести его в формат, понимаемый моделью, и по аналогии с примером добавить маркеры конца текста

def preprocess_function(dataset_row):
    question = dataset_row["question"]
    correct_answer = dataset_row["correct_answer"]
    distractors = f"DISTRACTORS: {dataset_row['distractor1']} | {dataset_row['distractor2']} | {dataset_row['distractor3']}"
    return f"QUESTION: {question}\nANSWER: {correct_answer}\n{distractors}\n<|endoftext|>"

# используем мэппинг для построчной обработки датасета, чтобы не перегружать колаб
train_dataset = train_dataset.map(lambda x: {"text": preprocess_function(x)})

Map:   0%|          | 0/11679 [00:00<?, ? examples/s]

In [7]:
train_dataset[:2]

{'question': ['What type of organism is commonly used in preparation of foods such as cheese and yogurt?',
  'What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?'],
 'distractor3': ['viruses', 'tropical effect'],
 'distractor1': ['protozoa', 'muon effect'],
 'distractor2': ['gymnosperms', 'centrifugal effect'],
 'correct_answer': ['mesophilic organisms', 'coriolis effect'],
 'support': ['Mesophiles grow best in moderate temperature, typically between 25°C and 40°C (77°F and 104°F). Mesophiles are often found living in or on the bodies of humans or other animals. The optimal growth temperature of many pathogenic mesophiles is 37°C (98°F), the normal human body temperature. Mesophilic organisms have important uses in food preparation, including cheese, yogurt, beer and wine.',
  'Without Coriolis Effect the global winds would blow north to south or south to north. Bu

### Hyperparameters

I tested many(more than 5) hyperparameter sets till I found one that works the best. I mostly tuned ***BATCH_SIZE*** (in this case, it's the number of forward-backward passes between each optimization step), ***EOPOCHS***, and ***LEARNING_RATE***.

For a parameter value starting point for fine-tuning, I inspired from [this](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py) and [this](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py) huggingface fine-tuning code.

In [8]:
BATCH_SIZE = 16
EPOCHS = 2
LEARNING_RATE = 3e-5
WARMUP_STEPS = 1000
MAX_SEQ_LEN = 400
from transformers import AdamW, get_linear_schedule_with_warmup as WarmupLinearSchedule

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

In [9]:
# Токенизация с явным возвратом input_ids и attention_mask
def tokenize_function(example):
    return tokenizer(
        example["text"],
        truncation=True,
        max_length=MAX_SEQ_LEN,
        padding="max_length",  # Устанавливаем фиксированную длину для всех
        return_tensors="pt"    # Указываем возвращать PyTorch тензоры
    )

# Применяем токенизацию
encoded_dataset = train_dataset.map(tokenize_function, batched=True)

# Удаляем лишние поля, оставляем только input_ids и attention_mask
encoded_dataset = encoded_dataset.remove_columns(["text"])  # Удаляем исходный текст, если он не нужен
encoded_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")
dataloader = DataLoader(encoded_dataset, batch_size=BATCH_SIZE, shuffle=True)


Map:   0%|          | 0/11679 [00:00<?, ? examples/s]

In [10]:
# # заранее токенизируем, чтобы не заниматься этим вручную при пробной генерации дистракторов
# encoded_dataset = train_dataset.map(lambda x: tokenizer(x["text"], truncation=True, max_length=MAX_SEQ_LEN, padding=True), batched=True)
# data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")
# dataloader = DataLoader(encoded_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=data_collator)


### Model training

I will train the model and save the model weights after each epoch and then I will try to generate jokes with each version of the weight to see which performs the best.

In [None]:
model = model.to(device)
model.train()

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = WarmupLinearSchedule(optimizer=optimizer, num_warmup_steps=10, num_training_steps=10)

proc_seq_count = 0
sum_loss = 0.0
batch_count = 0

models_folder = "trained_models"
if not os.path.exists(models_folder):
    os.mkdir(models_folder)

for epoch in range(EPOCHS):
    print(f"EPOCH {epoch} started" + '=' * 30)

    for batch in dataloader:
        input_ids = torch.tensor(batch["input_ids"]).to(device)

        # Пропуск через модель
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
        loss.backward()

        sum_loss += loss.item()
        proc_seq_count += 1

        if proc_seq_count == BATCH_SIZE:
            proc_seq_count = 0
            batch_count += 1
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
            model.zero_grad()

        if batch_count == 100:
            print(f"Sum loss: {sum_loss}")
            batch_count = 0
            sum_loss = 0.0

    # Сохранение модели после каждой эпохи
    torch.save(model.state_dict(), os.path.join(models_folder, f"gpt2_medium_sciq_{epoch}.pt"))



In [None]:
# prompt = f'<=question=>{}<key>{}<support>{}<distractor>:{distractor1}'
# prompt = f'<=question=>{}<key>{}<support>{}<distractor>:{distractor2}'
# prompt = f'<=question=>{}<key>{}<support>{}<distractor>:{distractor3}'
prompt = f"QUESTION: {question}\nANSWER: {correct_answer}\nDISTRACTORS:"

### Generating the jokes

In [None]:
# MODEL_EPOCH = 4

# models_folder = "trained_models"

# model_path = os.path.join(models_folder, f"gpt2_medium_joker_{MODEL_EPOCH}.pt")
# model.load_state_dict(torch.load(model_path))

# jokes_output_file_path = f'generated_{MODEL_EPOCH}.jokes'

# model.eval()
# if os.path.exists(jokes_output_file_path):
#     os.remove(jokes_output_file_path)

# joke_num = 0
# with torch.no_grad():

#         for joke_idx in range(1000):

#             joke_finished = False

#             cur_ids = torch.tensor(tokenizer.encode("JOKE:")).unsqueeze(0).to(device)

#             for i in range(100):
#                 outputs = model(cur_ids, labels=cur_ids)
#                 loss, logits = outputs[:2]
#                 softmax_logits = torch.softmax(logits[0,-1], dim=0) #Take the first(from only one in this case) batch and the last predicted embedding
#                 if i < 3:
#                     n = 20
#                 else:
#                     n = 3
#                 next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n) #Randomly(from the topN probability distribution) select the next word
#                 cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1) # Add the last word to the running sequence

            #     if next_token_id in tokenizer.encode('<|endoftext|>'):
            #         joke_finished = True
            #         break


            # if joke_finished:

            #     joke_num = joke_num + 1

            #     output_list = list(cur_ids.squeeze().to('cpu').numpy())
            #     output_text = tokenizer.decode(output_list)

            #     with open(jokes_output_file_path, 'a') as f:
            #         f.write(f"{output_text} \n\n")



3rd epoch model seemed to perform the best.

The generated jokes output was too long for a notebook, so I stored it in [this file](https://github.com/mf1024/transformers/blob/master/generated_2_jokes.txt).

In [None]:
# from transformers import GPT2LMHeadModel, GPT2Tokenizer
# import torch

# # Загрузка дообученной модели
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model_path = "trained_models/gpt2_medium_sciq_2.pt"  # Замените на путь к вашей модели
# model = GPT2LMHeadModel.from_pretrained("gpt2-medium")
# model.load_state_dict(torch.load(model_path))
# model = model.to(device)
# model.eval()

# # Загрузка токенизатора
# tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")

# # Пример вопроса
# question = "What is the largest planet in the solar system?"
# correct_answer = "Jupiter"

# # Подготовка промпта
# prompt = f"QUESTION: {question}\nANSWER: {correct_answer}\nDISTRACTORS:"

# # Токенизация входных данных
# input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

# # Генерация текста
# outputs = model.generate(
#     input_ids,
#     max_length=100,  # Максимальная длина генерируемого текста
#     num_return_sequences=1,  # Количество вариантов текста
#     temperature=0.7,  # Степень разнообразия
#     top_k=50,  # Ограничение выборки на основе вероятности
#     top_p=0.9,  # Сэмплинг на основе кумулятивной вероятности
#     eos_token_id=tokenizer.eos_token_id  # Идентификатор конца текста
# )

# # Распаковка и обработка результата
# generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# # Извлечение дистракторов из текста
# distractors = generated_text.split("DISTRACTORS:")[-1].strip().split(" | ")
# print(f"Generated Distractors: {distractors}")
