# Generative Supervised Fine-tuning of GPT-2

Now that we have our GPT-2 model all trained up - we need a way we can get it to generate what we want.

In the following notebook, we're going to use an approach called "Supervised Fine-tuning" to achieve our goals today.

In essence, we're going to use each example as a self-contained unit (with potential for something called "packing") and this is going to allow us to build "labeled" data.

For this notebook, we're going to be flying quite high up in the levels of abstraction. Take extra care to look into the libraries we're using today!

Let's start by grabbing our dependencies, as always:



In [None]:
!pip install transformers accelerate datasets trl bitsandbytes -qU

## Dataset Curation

We're going to be fine-tuning our model on SQL generation today.

First thing we'll need is a dataset to train on!

We'll use [this](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset today!

First up, let's load it and take a look at what we've got.

In [None]:
from datasets import load_dataset

sql_dataset = load_dataset("b-mc2/sql-create-context")

In [None]:
sql_dataset

In [None]:
sql_dataset['train'][0]

So, we've got ~78.5K rows of:

- question - a natural language query about
- context - the `CREATE TABLE` statement - which gives us important context about the table
- answer - a SQL query that is aligned with both the question and the context.

Let's split our data into `train`, `val`, and `test` datasets.

We can use our `train` and `val` sets to train and evaluate our model during training - and our `test` set to ultimately benchmark the generations of our model!

In [None]:
sql_dataset_train_test = sql_dataset["train"].train_test_split(test_size=0.2)

In [None]:
sql_dataset_train_test

In [None]:
sql_dataset_val_test = sql_dataset_train_test["test"].train_test_split(test_size=0.5)

In [None]:
sql_dataset_val_test

In [None]:
from datasets import DatasetDict

split_sql_dataset = DatasetDict({
    "train" : sql_dataset_train_test["train"],
    "val" : sql_dataset_val_test["train"],
    "test" : sql_dataset_val_test["test"]
})

In [None]:
split_sql_dataset

### Creating a "Prompt"

Now we need to create a prompt that's going to allow us to interact with our model when we desired the trained behaviour.

Think of this as a pattern that aligns the model with our desired outputs.

We need a single text prompt, as that is what the `SFTTrainer` we're going to use to fine-tune our model expects.

The basic idea is that we're going to merge the `question`, `context`, and `answer` into a single block of text that shows the model our desired outputs.

Let's look at what that block needs to look like:

```
{bos_token}### Instruction:
{system_message}

### Input:
{input}

### Context:
{context}

### Response:
{response}{eos_token}
```

Let's look at that from a completed prompt perspective to get a bit more information:

```
<|startoftext|>### Instruction:
You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
You must output the SQL query that answers the question.

### Input:
How many locations did the team play at on week 7?

### Context:
CREATE TABLE table_24123547_2 (location VARCHAR, week VARCHAR)

### Response:SELECT COUNT(location) FROM table_24123547_2 WHERE week = 7<|endoftext|>
```

As you can see, our prompt contains completed examples of our task. We're going to show our model many of these examples over and over again to teach it to produce outputs that are aligned with our goals!

First step, let's create a template we can use to call `.format()` on while constructing our prompts.

###🏗️Activity:

Create the following templates:

- `TEXT2SQL_TRAINING_PROMPT_TEMPLATE`
- `TEXT2SQL_INFERENCE_PROMPT_TEMPLATE`

> HINT: Remember that during inference we do not want to prepopulate the response.

In [None]:
TEXT2SQL_TRAINING_PROMPT_TEMPLATE = """\
{bos_token}### Instruction:
{system_message}

### Input:
{input}

### Context:
{context}

### Response:
{response}{eos_token}
"""

TEXT2SQL_INFERENCE_PROMPT_TEMPLATE = """\
{bos_token}### Instruction:
{system_message}

### Input:
{input}

### Context:
{context}
"""

Now let's create a function we can map over our dataset to create the full prompt text block.

###🏗️Activity:

Define a `SYSTEM_MESSAGE` to use with your training data.

In [None]:
SYSTEM_MESSAGE = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables. You must output the SQL query that answers the question.
                 """


In [None]:
def create_sql_prompt(sample):
  full_prompt = TEXT2SQL_TRAINING_PROMPT_TEMPLATE.format(
      bos_token = "<|startoftext|>",
      eos_token = "<|endoftext|>",
      system_message = SYSTEM_MESSAGE,
      input = sample["question"],
      context = sample["context"],
      response = sample["answer"]
  )

  return {"text" : full_prompt}

I've created this helper-function to be able to see how our model is doing visibly, rather than only through metrics.

In [None]:
def create_sql_prompt_and_response(sample):
  full_prompt = TEXT2SQL_INFERENCE_PROMPT_TEMPLATE.format(
      bos_token = "<|startoftext|>",
      system_message = SYSTEM_MESSAGE,
      input = sample["question"],
      context = sample["context"]
  )

  ground_truth = sample["answer"]

  return {"full_prompt" : full_prompt, "ground_truth" : ground_truth}

Let's look at an example of a formatted prompt.

In [None]:
create_sql_prompt(split_sql_dataset["train"][0])

Great!

Now we can map this over our dataset!

In [None]:
split_sql_dataset = split_sql_dataset.map(create_sql_prompt)

## Load the Model And Preproccessing

Now for the moment we've all been waiting for...

Loading our model!

Let's use the `AutoModelForCausalLM` and `AutoTokenzier` classes from `transformers` to see just how easy this is.

- [`AutoModelForCausalLM`](https://huggingface.co/docs/transformers/v4.35.0/en/model_doc/auto#transformers.AutoModelForCausalLM)
- [`AutoTokenizer`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer)
- [GPT-2 Model Card](https://huggingface.co/gpt2)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "gpt2"

gpt2_base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"
)

gpt2_tokenizer = AutoTokenizer.from_pretrained(model_id)

We need to make sure our tokenizer has a `pad_token` in order to be able to pad sequences so they're all the same length.

We'll use a little trick here to set our padding token to our eos (end of sequence) token to make training go a little smoother.

In [None]:
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

We also need to make sure we resize our model to be aligned with the token embeddings. If we didn't do this - we'd face a shape error while training!

In [None]:
gpt2_base_model.resize_token_embeddings(len(gpt2_tokenizer))

Now let's use the Hugging Face `pipeline` to see what generation looks like for our untrained model.

In [None]:
from transformers import pipeline, set_seed, GenerationConfig

generator = pipeline('text-generation', model=gpt2_base_model, tokenizer=gpt2_tokenizer)
set_seed(42)

def generate_sample(sample):
  prompt_package = create_sql_prompt_and_response(sample)

  generation_config = GenerationConfig(
      max_new_tokens=50,
      do_sample=True,
      top_k=50,
      temperature=1e-4,
      eos_token_id=gpt2_base_model.config.eos_token_id,
  )

  print("Input processed : ",prompt_package["full_prompt"])

  generation = generator(prompt_package["full_prompt"], generation_config=generation_config)
  print("---------------")
  print("Model Response:")
  print(generation[0]["generated_text"].replace(prompt_package["full_prompt"], ""))
  print("+++++++++++++++")
  print("Ground Truth")
  print(prompt_package["ground_truth"])

In [None]:
generate_sample(split_sql_dataset["test"][0])

## Training the Model

Now that we have our model set up, our tokenizer set up, we can finally begin training!

We'll be using the `TrainingArguments` from Hugging Face's `transformers` library to help us keep track of our hyper-parameters. More information and documentation available:

- [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)

Let's look at our Trainer, and set some hyper-parameters:

- `per_device_train_batch_size` - this is a batch size that accomodates distributed training
- `gradient_accumulation_steps` - this is exactly the same as the previous notebook, it's a way to "simulate" a large batch size by collecting losses over multiple iterations - scaling them - and then combining them together.
- `gradient_checkpointing` - I'll let the authors speak for themselves [here](https://github.com/cybertronai/gradient-checkpointing). In essence: This saves memory at the cost of computational time.
- `max_grad_norm` - this is the value used for gradient clipping, which is a method of reducing vanishing gradient potential
- `max_steps` - how many steps will we train for?
- `learning_rate` - how fast should we learn?
- `save_total_limit` - how many versions of the model will we save?
- `logging_steps` - how often we should log
- `output_dir` - where to save our checkpoints
- `optim` - which optimizer to use, you'll notice we're using a full precision paged optimizer - this is a performative and stable optimizer - but it uses extra memory
- `lr_scheduler_type` - we are once again using a cosine scheduler!
- `evaluation_strategy` - we have an evaluation dataset, this defines when we should leverage it during training
- `eval_steps` - how many steps we should evaluate for
- `warmup_ration` - how many "warmup" steps we take to reach our full learning rate before we start decaying. this is a ration of our max_steps

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer, SFTConfig

# training_args = TrainingArguments(
#  per_device_train_batch_size=4,
#  gradient_accumulation_steps=4,
#  gradient_checkpointing =True,
#  max_grad_norm= 0.3,
#  ###num_train_epochs=2,
#  max_steps=500,
#  learning_rate=2e-4,
#  save_total_limit=3,
#  logging_steps=10,
#  output_dir="sql_gpt2",
#  optim="paged_adamw_32bit",
#  lr_scheduler_type="cosine",
#  eval_strategy="steps",
#  eval_steps=50,
#  warmup_ratio=0.05,
#  dataset_text_field="text",
#  max_seq_length=1024
# )

from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    max_grad_norm=0.3,
    max_steps=500,
    learning_rate=2e-4,
    save_total_limit=3,
    logging_steps=10,
    output_dir="sql_gpt2",
    optim="paged_adamw_32bit",
    lr_scheduler_type="cosine",
    eval_strategy="steps",  # correct name
    eval_steps=50,
    warmup_ratio=0.05,
    dataset_text_field="text",  # only works here, not in TrainingArguments
)




###❓Question

Is this process using usupervised, or supervised learning?

\#\#\# YOUR RESPONSE HERE

Now, for our `SFTTrainer` AKA "Where the magic happens".

You can read all about the `SFTTrainer` here:

- [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer#trl.SFTTrainer)

This `SFTTrainer` is going to take our above training arguments, our data, our model and our tokenizer, and train it all for us!

Notice that we're setting `max_seq_length` to the maximum context window of our model - this ensures we do not exceed our maximum context window, and will pad our examples up to the maximum context window!

In [None]:
SFTTrainer?

In [None]:
trainer = SFTTrainer(
    model=gpt2_base_model,
    args=training_args,
    train_dataset=split_sql_dataset["train"],
    eval_dataset=split_sql_dataset["val"],
    #tokenizer=gpt2_tokenizer
)

In [None]:
# config = SFTConfig(
#     dataset_text_field="text",
#     max_seq_length=1024,
#     # add other relevant settings
# )
# trainer = SFTTrainer(
#  gpt2_base_model,
#  args=config,
#  train_dataset=split_sql_dataset["train"],
#  eval_dataset=split_sql_dataset["val"],
#  tokenizer=gpt2_tokenizer,
#  args=training_args
# )

# from trl import SFTConfig, SFTTrainer

# config = SFTConfig(
#     dataset_text_field="text",
#     max_seq_length=512,
#     # add other relevant settings
# )

# trainer = SFTTrainer(
#     model=model,
#     args=config,
#     train_dataset=split_sql_dataset["train"],
#     eval_dataset=split_sql_dataset["val"],
#     tokenizer=tokenizer,              # if needed
#     data_collator=data_collator,      # if needed
#     # other settings as needed
# )

###❓Question

What do we use to determine loss in fine-tuning GPT-2?

> HINT: [This](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py) can help you if you get stuck!

Finally, we can call our `.train()` method and watch it go!

In [None]:
trainer.train()

Let's save our fine-tuned model!

In [None]:
trainer.save_model()

## Testing our Model

Now that we have a fine-tuned model, let's see how it did

In [None]:
ft_gpt2_model = AutoModelForCausalLM.from_pretrained("sql_gpt2")

In [None]:
generator = pipeline('text-generation', model=ft_gpt2_model, tokenizer=gpt2_tokenizer, )

In [None]:
split_sql_dataset["test"][0]

In [None]:
generate_sample(split_sql_dataset["test"][4])

That is *significantly* better.

#### ###❓Question

How might you evaluate your generated SQL? Please provide 3 different methods.

In [None]:
https://huggingface.co/evaluate-metric/spaces?p=1