# Finetune GPT-2 to talk like Shakespeare

Here you will finetune, i.e. continue to train, GPT-2 on a corpus of Shakespeare. GPT-2 was pretrained on contemporary English, but you can prompt your model and see how it replies.

References:
* https://www.philschmid.de/fine-tune-a-non-english-gpt-2-model-with-huggingface
* https://huggingface.co/docs/transformers/en/model_doc/gpt2

In [4]:
! pip install --user transformers[torch] torch torchvision torchaudio # tf-keras

Collecting accelerate>=0.26.0 (from transformers[torch])
  Downloading accelerate-1.5.2-py3-none-any.whl.metadata (19 kB)
Downloading accelerate-1.5.2-py3-none-any.whl (345 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.1/345.1 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: accelerate
[0mSuccessfully installed accelerate-1.5.2


Now restart your kernel with **Kernel > Restart Kernel**. Test the installation by running:

In [None]:
import transformers

# Load Shakespeare data for training

In [1]:
# First load a tokenizer, which specifies the subword tokenization for an LLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')

In [4]:
# 
from transformers import TextDataset, DataCollatorForLanguageModeling

train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path='data/shakespeare_train.txt',
    block_size=64)

test_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path='data/shakespeare_test.txt',
    block_size=64)

data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False)

2025-03-18 19:18:47.699307: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-18 19:18:47.728322: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742339927.760994   59406 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742339927.770120   59406 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742339927.793459   59406 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

# Initialize training settings

In [7]:
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")


training_args = TrainingArguments(
    output_dir="./gpt2-shakespeare", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 300, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    )


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


# Finetune (train) the model
And save it out.

In [8]:
trainer.train()

Step,Training Loss
500,4.4992


TrainOutput(global_step=547, training_loss=4.477526091134527, metrics={'train_runtime': 151.4544, 'train_samples_per_second': 115.395, 'train_steps_per_second': 3.612, 'total_flos': 570825105408000.0, 'train_loss': 4.477526091134527, 'epoch': 1.0})

In [9]:
trainer.save_model() # this will save to the directory specified in the TrainingArguments object

# Generate text from the finetuned model

In [11]:
from transformers import pipeline

shakespeare_gpt2 = pipeline('text-generation', model='./gpt2-shakespeare', tokenizer='gpt2')

Device set to use cuda:0


In [13]:
shakespeare_gpt2('Please')[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"Please.\nO, so you do, though you are all of you dead!\nI have a very large purse for my life; if the poor\nLet them rest their grief's sighs thereabouts,\nThe way I have made it"