# Fine-tuning Transformers for Text Generation

In this notebook we'll take a look at fine-tuning a famous Transformer model called [GPT-2](https://huggingface.co/gpt-2) to generate text in the style of Shakespeare. By the end of this notebook you should know how to:

* Load and process a dataset from the Hugging Face Hub
* Fine-tune and evaluate a pretrained model on your data
* Push a model to the Hugging Face Hub

Let's get started!

## Setup

If you're running this notebook on Google Colab or locally, you'll need a few dependencies installed. You can install them with `pip` as follows:

In [None]:
#! pip install datasets transformers sentencepiece

To be able to share your model with the community there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Then you need to install Git-LFS. Uncomment and execute the following cell:

In [None]:
# !apt install git-lfs

## The dataset

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("tiny_shakespeare")

Using custom data configuration default
Found cached dataset tiny_shakespeare (/home/lewis/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e)


  0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})

In [3]:
model_checkpoint = "distilgpt2"

In [4]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [5]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, remove_columns=["text"])

Loading cached processed dataset at /home/lewis/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-6b61ee6f8f0b8649.arrow
Loading cached processed dataset at /home/lewis/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-bfe9c31bad1c5043.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (17995 > 1024). Running this sequence through the model will result in indexing errors


In [6]:
# get block size (max input length of the model)
block_size = 128
    
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

# split total dataset into smaller sets of length block_size
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True
)

Loading cached processed dataset at /home/lewis/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-895352a53b66d759.arrow
Loading cached processed dataset at /home/lewis/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-d0298be5220e086a.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

In [9]:
print(tokenizer.decode(lm_datasets["train"][1]["input_ids"]))

 done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our p


In [10]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

In [11]:
from transformers import Trainer, TrainingArguments

In [12]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-shakespeare",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=10,
    push_to_hub=True,
)

In [13]:
%env TOKENIZERS_PARALLELISM=false

env: TOKENIZERS_PARALLELISM=false


In [14]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

/home/lewis/git/workshops/luzern-university/distilgpt2-finetuned-shakespeare is already a clone of https://huggingface.co/lewtun/distilgpt2-finetuned-shakespeare. Make sure you pull the latest changes with `repo.git_pull()`.


In [15]:
trainer.train()

***** Running training *****
  Num examples = 2359
  Num Epochs = 10
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1480


Epoch,Training Loss,Validation Loss
1,No log,3.832926
2,No log,3.727563
3,No log,3.663753
4,3.948400,3.633251
5,3.948400,3.610242
6,3.948400,3.593349
7,3.695200,3.584735
8,3.695200,3.579928
9,3.695200,3.575851
10,3.695200,3.574579


***** Running Evaluation *****
  Num examples = 141
  Batch size = 16
***** Running Evaluation *****
  Num examples = 141
  Batch size = 16
***** Running Evaluation *****
  Num examples = 141
  Batch size = 16
Saving model checkpoint to distilgpt2-finetuned-shakespeare/checkpoint-500
Configuration saved in distilgpt2-finetuned-shakespeare/checkpoint-500/config.json
Model weights saved in distilgpt2-finetuned-shakespeare/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 141
  Batch size = 16
***** Running Evaluation *****
  Num examples = 141
  Batch size = 16
***** Running Evaluation *****
  Num examples = 141
  Batch size = 16
Saving model checkpoint to distilgpt2-finetuned-shakespeare/checkpoint-1000
Configuration saved in distilgpt2-finetuned-shakespeare/checkpoint-1000/config.json
Model weights saved in distilgpt2-finetuned-shakespeare/checkpoint-1000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 141
  Batch size = 16
***** Running

TrainOutput(global_step=1480, training_loss=3.756452735694679, metrics={'train_runtime': 224.3666, 'train_samples_per_second': 105.14, 'train_steps_per_second': 6.596, 'total_flos': 770498793308160.0, 'train_loss': 3.756452735694679, 'epoch': 10.0})

In [None]:
from transformers import pipeline

In [None]:
pipe = pipeline("text-generation", model=trainer.model, tokenizer=tokenizer, device=0)

In [None]:
outputs = pipe("HAMLET: Behold, I am ")[0]

In [None]:
print(outputs["generated_text"])

In [None]:
## Push to hub