# Fine tunning GPT-2 on custom dataset
In this notebook, we will train the [instruction tuned GPT-2](./instruction-tunning-gpt2-alpaca.ipynb) model on a custom dataset. The [custom dataset](../dataset/captions/training-captions.csv) is a collection of Instagram captions that match format that we want to generate. 

The goal is that the model will learn to generate captions that are similar to the ones in the dataset. 

In [None]:
# Path to the saved model
model_name = "../models/gpt2_alpaca_preprocess_fn/best_model"
out_dir = "../models/gpt2_alpaca_preprocess_fn_custom"

In [1]:
# Install dependencies
%pip install accelerate transformers datasets trl

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    TrainingArguments,
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline,
    logging,
)
from trl import SFTTrainer

## Loading the custom dataset 

We'll start with a very small dataset with only 104 examples. The dataset is a collection of Instagram captions that match the format that we want to generate.

In [5]:
path = '../dataset/captions/training-captions.csv'
dataset = load_dataset('csv', data_files=path)['train']
dataset

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 104
})

In [7]:
# print one sample
dataset[0]


{'instruction': 'Generate a caption for a photo of Lola.',
 'input': 'a small dog laying on a blue blanket',
 'output': 'Lola relaxing on her blue blanket.',
 'text': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Generate a caption for a photo of Lola. ### Input: a small dog laying on a blue blanket ### Response: Lola relaxing on her blue blanket.'}

In [10]:
# Split the dataset into training and validation sets
full_dataset = dataset.train_test_split(test_size=0.10, shuffle=True)
dataset_train = full_dataset['train']
dataset_valid = full_dataset['test']
 
print(dataset_train)
print(dataset_valid)

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 93
})
Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 11
})


## Preprocessing the dataset to be used for training GPT-2 model

We will use a pre-processing function to prepare the dataset for training the GPT-2 model. The function will concatenate the instruction, input and output columns and formats it into a string with headers 'Instruction', 'Input', and 'Response' for each corresponding value. It returns this structured string.

In [11]:
def preprocess_function(example):
    """
    This function formats the dictionary values into a single string with specific section headers and returns this string.
    The returned string is structured as follows:
    - Starts with "### Instruction:" followed by the instruction value from the dictionary.
    - Then "### Input:" followed by the input value from the dictionary.
    - Finally "### Response:" followed by the output value from the dictionary.
    Each section is separated by two newline characters for clear demarcation.
    """
    text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    return text

In [12]:
# Example of the preprocess_function
preprocess_function(dataset_train[0])

'### Instruction:\nGenerate a caption for a photo of Chiki and Lola.\n\n### Input:\ntwo dogs enjoying the view from a hilltop\n\n### Response:\nChiki and Lola taking in the beautiful view from the hilltop. Nature at its finest.'

In [17]:
if bf16:
    model = AutoModelForCausalLM.from_pretrained(model_name).to(dtype=torch.bfloat16)
else:
    model = AutoModelForCausalLM.from_pretrained(model_name)

# Total parameters and trainable parameters.
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")

total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

124,439,808 total parameters.
124,439,808 training parameters.


## Initialize Tokenizer and Set Padding Token
Next we are initializing the tokenizer with the GPT-2 model that we previously pre-trained with the Alpaca dataset. The `trust_remote_code=True` argument allows the use of tokenizers that include custom (user-provided) code. We set `use_fast=False` to use the Python-based implementation of the tokenizer. After initializing the tokenizer, we set its padding token to be the same as the end-of-sentence (EOS) token. This is done because GPT-2 doesn't have a specific padding token, and we need one for sequences of different lengths.

In [18]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True,
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

## Set Training Arguments
-   `output_dir=f"{out_dir}/logs"`: Sets the output directory where the model predictions and checkpoints will be written. 

-   `evaluation_strategy='steps'`: Sets the evaluation strategy during training to `'steps'`, meaning the model will be evaluated at regular step intervals.

-   `weight_decay=0.01`: Applies a weight decay of 0.01 to help prevent the model from overfitting.

-   `load_best_model_at_end=True`: Ensures the best model found during training is loaded at the end of training.

-   `per_device_train_batch_size=4`: Sets the batch size per device during training to 4.

-   `per_device_eval_batch_size=4`: Sets the batch size for evaluation to 4.

-   `logging_strategy='steps'`: Sets the logging strategy during training to `'steps'`, meaning logging will occur at regular step intervals.

-   `save_strategy='steps'`: Sets the strategy for saving checkpoints to `'steps'`, meaning checkpoints will be saved at regular step intervals.

-   `logging_steps=100`: Sets the number of steps between each logging to 100.

-   `save_steps=1000`: Sets the number of steps between each checkpoint save to 1000.

-   `save_total_limit=2`: Limits the total amount of checkpoints that can be saved to 2. The oldest checkpoint will be deleted when this limit is reached.

-   `bf16=False`: Sets whether to use bf16 precision for training to `False`. 

-   `fp16=False`: Sets whether to use fp16 precision for training to `False`.

-   `fp16_full_eval=False`: Sets whether to use fp16 precision for evaluation to `False`.

-   `report_to='tensorboard'`: Sets where to report the results to `'tensorboard'`.

-   `max_steps=1000`: Sets the total number of training steps to 1000.

-   `dataloader_num_workers=os.cpu_count()`: Sets the number of worker threads to use for data loading to the number of CPUs available on the system.

-   `gradient_accumulation_steps=1`: Sets the number of steps to accumulate gradients before performing an optimizer step to 1.

-   `learning_rate=0.00003`: Sets the initial learning rate for the optimizer to 0.00003.

-   `lr_scheduler_type='constant'`: Sets the type of learning rate scheduler to use to `'constant'`, meaning the learning rate will remain constant throughout training.

In [19]:
training_args = TrainingArguments(
    output_dir=f"{out_dir}/logs",
    evaluation_strategy='steps',
    weight_decay=0.01,
    load_best_model_at_end=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_strategy='steps',
    save_strategy='steps',
    logging_steps=100,
    save_steps=1000,
    save_total_limit=2,
    bf16=False,
    fp16=False,
    fp16_full_eval=False,
    report_to='tensorboard',
    max_steps=1000,
    dataloader_num_workers=os.cpu_count(),
    gradient_accumulation_steps=1,
    learning_rate=0.00003,
    lr_scheduler_type='constant',
)

In [20]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_train,
    eval_dataset=dataset_valid,
    max_seq_length=context_length,
    tokenizer=tokenizer,
    args=training_args,
    formatting_func=preprocess_function,
    packing=True
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [21]:
dataloader = trainer.get_train_dataloader()
for i, sample in enumerate(dataloader):
    print(tokenizer.decode(sample['input_ids'][0]))
    print('#'*50)
    if i == 5:
        break

 nature's beauty together.<|endoftext|>### Instruction:
Generate a caption for a photo of Chiki.

### Input:
a dog sitting by a campfire

### Response:
Chiki enjoying the warmth by the campfire.<|endoftext|>### Instruction:
Generate a caption for a photo of Chiki.

### Input:
a small dog looking at a tree

### Response:
Chiki always curious about nature.<|endoftext|>### Instruction:
Generate a caption for a photo of Chiki and Lola.

### Input:
two small dogs on a couch

### Response:
Chiki and Lola snuggling on the couch. #CouchBuddies #ChillTime<|endoftext|>### Instruction:
Generate a caption for a photo of Chiki and Lola.

### Input:
two chihuahuas playing by the lake

### Response:
Playtime by the lake for Chiki and Lola. Joy and laughter in every moment.<|endoftext|>### Instruction:
Generate a caption for a photo of Chiki and Lola.

### Input:
two small dogs running on the beach

### Response:
Chiki and Lola enjoying their beach
##################################################
 d

In [22]:
# Start training
history = trainer.train()

  0%|          | 0/1000 [00:00<?, ?it/s]

{'loss': 0.5753, 'grad_norm': 2.101097822189331, 'learning_rate': 3e-05, 'epoch': 20.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.8485015630722046, 'eval_runtime': 45.543, 'eval_samples_per_second': 0.044, 'eval_steps_per_second': 0.022, 'epoch': 20.0}
{'loss': 0.1378, 'grad_norm': 1.809464454650879, 'learning_rate': 3e-05, 'epoch': 40.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 1.0810812711715698, 'eval_runtime': 46.348, 'eval_samples_per_second': 0.043, 'eval_steps_per_second': 0.022, 'epoch': 40.0}
{'loss': 0.0701, 'grad_norm': 1.174267292022705, 'learning_rate': 3e-05, 'epoch': 60.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 1.220778226852417, 'eval_runtime': 43.8007, 'eval_samples_per_second': 0.046, 'eval_steps_per_second': 0.023, 'epoch': 60.0}
{'loss': 0.0377, 'grad_norm': 1.450278639793396, 'learning_rate': 3e-05, 'epoch': 80.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 1.3182388544082642, 'eval_runtime': 43.9591, 'eval_samples_per_second': 0.045, 'eval_steps_per_second': 0.023, 'epoch': 80.0}
{'loss': 0.0225, 'grad_norm': 0.9912919998168945, 'learning_rate': 3e-05, 'epoch': 100.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 1.3886469602584839, 'eval_runtime': 43.0452, 'eval_samples_per_second': 0.046, 'eval_steps_per_second': 0.023, 'epoch': 100.0}
{'loss': 0.0146, 'grad_norm': 0.22138874232769012, 'learning_rate': 3e-05, 'epoch': 120.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 1.4245778322219849, 'eval_runtime': 42.8491, 'eval_samples_per_second': 0.047, 'eval_steps_per_second': 0.023, 'epoch': 120.0}
{'loss': 0.0121, 'grad_norm': 0.6677820086479187, 'learning_rate': 3e-05, 'epoch': 140.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 1.502774715423584, 'eval_runtime': 43.5368, 'eval_samples_per_second': 0.046, 'eval_steps_per_second': 0.023, 'epoch': 140.0}
{'loss': 0.0097, 'grad_norm': 0.27721184492111206, 'learning_rate': 3e-05, 'epoch': 160.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 1.5635876655578613, 'eval_runtime': 43.1566, 'eval_samples_per_second': 0.046, 'eval_steps_per_second': 0.023, 'epoch': 160.0}
{'loss': 0.0075, 'grad_norm': 0.3528817892074585, 'learning_rate': 3e-05, 'epoch': 180.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 1.5918594598770142, 'eval_runtime': 43.8265, 'eval_samples_per_second': 0.046, 'eval_steps_per_second': 0.023, 'epoch': 180.0}
{'loss': 0.0076, 'grad_norm': 0.0635656863451004, 'learning_rate': 3e-05, 'epoch': 200.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 1.624525547027588, 'eval_runtime': 42.6813, 'eval_samples_per_second': 0.047, 'eval_steps_per_second': 0.023, 'epoch': 200.0}


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


{'train_runtime': 13900.6095, 'train_samples_per_second': 0.288, 'train_steps_per_second': 0.072, 'train_loss': 0.08949895161390305, 'epoch': 200.0}


In [23]:
import pandas as pd

history_dict = history._asdict()
history_df = pd.DataFrame(history_dict)
print(history_df)

                          global_step  training_loss       metrics
train_runtime                    1000       0.089499  1.390061e+04
train_samples_per_second         1000       0.089499  2.880000e-01
train_steps_per_second           1000       0.089499  7.200000e-02
total_flos                       1000       0.089499  4.703257e+14
train_loss                       1000       0.089499  8.949895e-02
epoch                            1000       0.089499  2.000000e+02


In [27]:
# Save the model and tokenizer
model.save_pretrained(f"{out_dir}/best_model")
tokenizer.save_pretrained(f"{out_dir}/best_model")

('../models/gpt2_alpaca_preprocess_fn_custom/best_model/tokenizer_config.json',
 '../models/gpt2_alpaca_preprocess_fn_custom/best_model/special_tokens_map.json',
 '../models/gpt2_alpaca_preprocess_fn_custom/best_model/vocab.json',
 '../models/gpt2_alpaca_preprocess_fn_custom/best_model/merges.txt',
 '../models/gpt2_alpaca_preprocess_fn_custom/best_model/added_tokens.json')

In [28]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(f'{out_dir}/best_model/')
tokenizer = AutoTokenizer.from_pretrained(f'{out_dir}/best_model/')
tokenizer.pad_token = tokenizer.eos_token

## Testing the model tuned on the custom dataset

In [29]:
from transformers import (
    AutoModelForCausalLM, 
    logging, 
    pipeline,
    AutoTokenizer
)
import torch

In [30]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(f'{out_dir}/best_model/')
tokenizer = AutoTokenizer.from_pretrained(f'{out_dir}/best_model/')
tokenizer.pad_token = tokenizer.eos_token

In [31]:
# Create a pipeline to generate text
pipe = pipeline(
    task='text-generation', 
    model=model, 
    tokenizer=tokenizer, 
    max_length=256, # Prompt + new tokens to generate.
    device_map=device
)

In [34]:
# Define the prompt using the same format in which the model was trained
template = """### Instruction:
{}
### Input:
{}
### Response:
{}"""

# Write the prompt
instructions = 'Generate a caption for a photo of Chiki'
inputs = 'a dog laying on a blanket'
response = ''
prompt = template.format(instructions, inputs, response)

# Generate text
outputs = pipe(
    prompt, 
    do_sample=True, 
    temperature=0.7, 
    top_k=50, 
    top_p=0.95,
    repetition_penalty=1.1,
)
print(outputs[0]['generated_text'])
outputs

### Instruction:
Generate a caption for a photo of Chiki
### Input:
a dog laying on a blanket
### Response:
Chiki enjoying a cozy moment on the blanket.


[{'generated_text': '### Instruction:\nGenerate a caption for a photo of Chiki\n### Input:\na dog laying on a blanket\n### Response:\nChiki enjoying a cozy moment on the blanket.'}]