# Generative Supervised Fine-tuning of GPT-2

Now that we have our GPT-2 model all trained up - we need a way we can get it to generate what we want.

In the following notebook, we're going to use an approach called "Supervised Fine-tuning" to achieve our goals today.

In essence, we're going to use each example as a self-contained unit (with potential for something called "packing") and this is going to allow us to build "labeled" data.

For this notebook, we're going to be flying quite high up in the levels of abstraction. Take extra care to look into the libraries we're using today!

Let's start by grabbing our dependencies, as always:



In [2]:
!pip install transformers accelerate datasets trl bitsandbytes -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m40.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m74.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

## Dataset Curation

We're going to be fine-tuning our model on SQL generation today.

First thing we'll need is a dataset to train on!

We'll use [this](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset today!

First up, let's load it and take a look at what we've got.

- [`load_dataset`](https://huggingface.co/docs/datasets/loading)

In [3]:
from datasets import load_dataset

sql_dataset = load_dataset(path="b-mc2/sql-create-context")

Downloading readme:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/21.8M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [4]:
sql_dataset

DatasetDict({
    train: Dataset({
        features: ['context', 'answer', 'question'],
        num_rows: 78577
    })
})

In [5]:
### Display a Sample Row

So, we've got ~78.5K rows of:

- question - a natural language query about
- context - the `CREATE TABLE` statement - which gives us important context about the table
- answer - a SQL query that is aligned with both the question and the context.

Let's split our data into `train`, `val`, and `test` datasets.

We can use our `train` and `val` sets to train and evaluate our model during training - and our `test` set to ultimately benchmark the generations of our model!

In [6]:
# Split the 'train' set into train and test set (e.g., 80-20 split)
train_test_split = sql_dataset['train'].train_test_split(test_size=0.2)

# You now have a train and test set
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']


In [7]:
# Further split the 'test' set into test and validation set (e.g., 50-50 split)
test_val_split = test_dataset.train_test_split(test_size=0.5)

# You now have a test and validation set
test_dataset = test_val_split['train']
val_dataset = test_val_split['test']


In [8]:
from datasets import DatasetDict

split_sql_dataset = DatasetDict({
    'train': train_dataset,
    'test': test_dataset,
    'val': val_dataset
})


In [9]:
split_sql_dataset

DatasetDict({
    train: Dataset({
        features: ['context', 'answer', 'question'],
        num_rows: 62861
    })
    test: Dataset({
        features: ['context', 'answer', 'question'],
        num_rows: 7858
    })
    val: Dataset({
        features: ['context', 'answer', 'question'],
        num_rows: 7858
    })
})

```
DatasetDict({
    train: Dataset({
        features: ['question', 'context', 'answer'],
        num_rows: 62861
    })
    val: Dataset({
        features: ['question', 'context', 'answer'],
        num_rows: 7858
    })
    test: Dataset({
        features: ['question', 'context', 'answer'],
        num_rows: 7858
    })
})
```

### Creating a "Prompt"

Now we need to create a prompt that's going to allow us to interact with our model when we desired the trained behaviour.

Think of this as a pattern that aligns the model with our desired outputs.

We need a single text prompt, as that is what the `SFTTrainer` we're going to use to fine-tune our model expects.

The basic idea is that we're going to merge the `question`, `context`, and `answer` into a single block of text that shows the model our desired outputs.

Let's look at what that block needs to look like:

```
{bos_token}### Instruction:
{system_message}

### Input:
{input}

### Context:
{context}

### Response:
{response}{eos_token}
```

Let's look at that from a completed prompt perspective to get a bit more information:

```
<|startoftext|>### Instruction:
You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
You must output the SQL query that answers the question.

### Input:
How many locations did the team play at on week 7?

### Context:
CREATE TABLE table_24123547_2 (location VARCHAR, week VARCHAR)

### Response:\nSELECT COUNT(location) FROM table_24123547_2 WHERE week = 7<|endoftext|>
```

As you can see, our prompt contains completed examples of our task. We're going to show our model many of these examples over and over again to teach it to produce outputs that are aligned with our goals!

First step, let's create a template we can use to call `.format()` on while constructing our prompts.

In [10]:
TEXT2SQL_TRAINING_PROMPT_TEMPLATE =  """
{bos_token}### Instruction:
You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
You must output the SQL query that answers the question.

### Input:
{input}

### Context:
{context}

### Response:
{response}{eos_token}
"""

In [11]:
TEXT2SQL_INFERENCE_PROMPT_TEMPLATE = """\
{bos_token}### Instruction:
{system_message}

### Input:
{input}

### Context:
{context}

### Response:
"""

Now let's create a function we can map over our dataset to create the full prompt text block.

In [12]:
def create_sql_prompt(sample):
    # Define the system message (instruction) for the model
    SYSTEM_MESSAGE = "You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables. You must output the SQL query that answers the question."

    # Fill in the template with the actual data from the sample
    full_prompt = TEXT2SQL_TRAINING_PROMPT_TEMPLATE.format(
        bos_token="<|startoftext|>",  # Replace with your model's actual beginning-of-sequence token
        eos_token="<|endoftext|>",    # Replace with your model's actual end-of-sequence token
        system_message=SYSTEM_MESSAGE,
        input=sample['question'],     # Replace with the 'question' field from the sample
        context=sample['context'],    # Replace with the 'context' field from the sample
        response=sample['answer']     # Replace with the 'answer' field from the sample
    )

    # Return the full prompt in the expected format
    return {"text": full_prompt}


#### Helper Function Begin.

I've created this helper-function to be able to see how our model is doing visibly, rather than only through metrics.

In [13]:
def create_sql_prompt_and_response(sample):
  SYSTEM_MESSAGE = f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
  You must output the SQL query that answers the question."""

  full_prompt = TEXT2SQL_INFERENCE_PROMPT_TEMPLATE.format(
      bos_token = "<|startoftext|>",
      system_message = SYSTEM_MESSAGE,
      input = sample["question"],
      context = sample["context"]
  )

  ground_truth = sample["answer"]

  return {"full_prompt" : full_prompt, "ground_truth" : ground_truth}

#### Helper Function End.

Let's look at an example of a formatted prompt.

In [14]:
create_sql_prompt(split_sql_dataset["train"][0])

{'text': '\n<|startoftext|>### Instruction:\nYou are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.\nYou must output the SQL query that answers the question.\n\n### Input:\nWhat stadium does FK Jedinstvo play in?\n\n### Context:\nCREATE TABLE table_28668784_1 (stadium VARCHAR, home_team VARCHAR)\n\n### Response:\nSELECT stadium FROM table_28668784_1 WHERE home_team = "FK Jedinstvo"<|endoftext|>\n'}

Great!

Now we can map this over our dataset!

- [`DatasetDict.map()`](https://huggingface.co/docs/datasets/process#map)

In [15]:
split_sql_dataset = split_sql_dataset.map(lambda examples: create_sql_prompt(examples))

Map:   0%|          | 0/62861 [00:00<?, ? examples/s]

Map:   0%|          | 0/7858 [00:00<?, ? examples/s]

Map:   0%|          | 0/7858 [00:00<?, ? examples/s]

## Load the Model And Preproccessing

Now for the moment we've all been waiting for...

Loading our model!

Let's use the `AutoModelForCausalLM` and `AutoTokenzier` classes from `transformers` to see just how easy this is.

- [`AutoModelForCausalLM`](https://huggingface.co/docs/transformers/v4.35.0/en/model_doc/auto#transformers.AutoModelForCausalLM)
- [`AutoTokenizer`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer)
- [GPT-2 Model Card](https://huggingface.co/gpt2)

In [16]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'gpt2'

# Load the pre-trained model
gpt2_base_model = AutoModelForCausalLM.from_pretrained(model_id)

# Load the tokenizer
gpt2_tokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

We need to make sure our tokenizer has a `pad_token` in order to be able to pad sequences so they're all the same length.

We'll use a little trick here to set our padding token to our eos (end of sequence) token to make training go a little smoother.

In [17]:
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

We also need to make sure we resize our model to be aligned with the token embeddings. If we didn't do this - we'd face a shape error while training!

In [18]:
gpt2_base_model.resize_token_embeddings(len(gpt2_tokenizer))

Embedding(50257, 768)

Now let's use the Hugging Face `pipeline` to see what generation looks like for our untrained model.

In [19]:
from transformers import pipeline, set_seed, GenerationConfig
generator = pipeline('text-generation', model=gpt2_base_model, tokenizer=gpt2_tokenizer)
set_seed(42)

def generate_sample(sample):
  prompt_package = create_sql_prompt_and_response(sample)

  generation_config = GenerationConfig(
      max_new_tokens=50,
      do_sample=True,
      top_k=50,
      temperature=1e-4,
      eos_token_id=gpt2_base_model.config.eos_token_id,
  )

  generation = generator(prompt_package["full_prompt"], generation_config=generation_config)
  print("---------------")
  print("Model Response:")
  print(generation[0]["generated_text"].replace(prompt_package["full_prompt"], ""))
  print("+++++++++++++++")
  print("Ground Truth")
  print(prompt_package["ground_truth"])

In [20]:
generate_sample(split_sql_dataset["test"][0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------
Model Response:

You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

### Input:

Which location had previous champions of Mike Webb
+++++++++++++++
Ground Truth
SELECT location FROM table_name_93 WHERE previous_champion_s_ = "mike webb and nick fahrenheit"


## Training the Model

Now that we have our model set up, our tokenizer set up, we can finally begin training!

Let's look at our Trainer, and set some hyper-parameters:

- `per_device_train_batch_size` - this is a batch size that accomodates distributed training - a default we could use is `4`
- `gradient_accumulation_steps` - this is exactly the same as the previous notebook, it's a way to "simulate" a large batch size by collecting losses over multiple iterations - scaling them - and then combining them together. - a default we could use is `4`
- `gradient_checkpointing` - I'll let the authors speak for themselves [here](https://github.com/cybertronai/gradient-checkpointing). In essence: This saves memory at the cost of computational time. - let's set this to `True`
- `max_grad_norm` - this is the value used for gradient clipping, which is a method of reducing vanishing gradient potential - let's use `0.3`
- `max_steps` - how many steps will we train for? - this is up to you
- `learning_rate` - how fast should we learn? - lets use `2e-4`
- `save_total_limit` - how many versions of the model will we save? - the default of `3` should work well
- `logging_steps` - how often we should log - up to you
- `output_dir` - where to save our checkpoints - up to you
- `optim` - which optimizer to use, you'll notice we're using a full precision paged optimizer - this is a performative and stable optimizer - but it uses extra memory - we should use `paged_adamw_32bit`
- `lr_scheduler_type` - we are once again using a cosine scheduler! - we should use `cosine`
- `evaluation_strategy` - we have an evaluation dataset, this defines when we should leverage it during training - we should use `steps`
- `eval_steps` - how many steps we should evaluate for - up to you
- `warmup_ration` - how many "warmup" steps we take to reach our full learning rate before we start decaying. This is a ration of our max_steps - the default value of `0.3` should work!

- [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.15.0/main_classes/trainer#transformers.TrainingArguments)

In [21]:
from google.colab import drive
drive.mount('/content/drive')



Mounted at /content/drive


In [22]:
import os
# Define the base path for Google Drive
drive_base_path = '/content/drive/My Drive/'

# Define the output directory within the Google Drive
output_dir = os.path.join(drive_base_path, 'LLME-1')

# Create the output directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

In [44]:
from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    max_grad_norm=0.3,
    max_steps=5000,  # Replace with the number of training steps you want
    learning_rate=2e-4,
    save_total_limit=3,
    logging_steps=100,  # Replace with how often you want to log
    output_dir=output_dir,
    optim='adamw_torch',  # 'paged_adamw_32bit' is not a standard option in transformers, using 'adamw_torch' instead
    lr_scheduler_type='cosine',
    evaluation_strategy='steps',
    eval_steps=250,  # Replace with how often you want to evaluate
    warmup_ratio=0.3,
    gradient_checkpointing=True,  # Enable gradient checkpointing to save memory
    # Add any other parameters you want to configure
)

Now, for our `SFTTrainer` AKA "Where the magic happens".

This `SFTTrainer` is going to take our above training arguments, our data, our model and our tokenizer, and train it all for us!

Notice that we're setting `max_seq_length` to the maximum context window of our model - this ensures we do not exceed our maximum context window, and will pad our examples up to the maximum context window!

#### ❓QUESTION❓

What is the maximum input sequence length for GPT-2?

In [47]:
trainer = SFTTrainer(
 gpt2_base_model,
 dataset_text_field="text",
 train_dataset=split_sql_dataset["train"],
 eval_dataset=split_sql_dataset["val"],
 tokenizer=gpt2_tokenizer,
 max_seq_length=1024,
 args=training_args
)

Map:   0%|          | 0/62861 [00:00<?, ? examples/s]

Map:   0%|          | 0/7858 [00:00<?, ? examples/s]

Finally, we can call our `.train()` method and watch it go!

In [48]:
#trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
250,0.7878,0.639709
500,0.6227,0.574429
750,0.5755,0.540961
1000,0.5627,0.522454
1250,0.5462,0.511856
1500,0.5316,0.504759
1750,0.5126,0.499141
2000,0.4988,0.48404
2250,0.495,0.47826
2500,0.4827,0.466349




TrainOutput(global_step=5000, training_loss=0.5347139999389648, metrics={'train_runtime': 7320.2855, 'train_samples_per_second': 10.929, 'train_steps_per_second': 0.683, 'total_flos': 6580788844032000.0, 'train_loss': 0.5347139999389648, 'epoch': 1.27})

Let's save our fine-tuned model!

In [52]:
#trainer.save_model()

## Testing our Model

Now that we have a fine-tuned model, let's see how it did

In [23]:
ft_gpt2_model = AutoModelForCausalLM.from_pretrained('/content/drive/My Drive/LLME-1')

In [24]:
generator = pipeline('text-generation', model=ft_gpt2_model, tokenizer=gpt2_tokenizer, )

In [25]:
generate_sample(split_sql_dataset["test"][0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------
Model Response:
SELECT location FROM table_name_93 WHERE previous_champion_s_ = "mike webbwee" AND previous_champion_s_ = "nick fahrenheit" AND previous_champion_s_ = "
+++++++++++++++
Ground Truth
SELECT location FROM table_name_93 WHERE previous_champion_s_ = "mike webb and nick fahrenheit"


That is *significantly* better.

#### ❓QUESTION❓

What methods could we use to validate our SQL outputs?

##Validating SQL Outputs
* Syntax Validation: Ensure that the generated SQL queries are syntactically correct. This can be done using SQL parsers or by attempting to execute the queries against a test database (without committing any changes).

* Semantic Validation: Check if the SQL queries are semantically correct and actually answer the questions posed. This involves executing the queries against a database and comparing the results with expected outputs.

* Comparison with Ground Truth: If you have a dataset where the correct SQL queries are known (like in a benchmark dataset), compare the generated queries with these ground truth queries.

* Manual Review: In some cases, especially for complex queries, manual review by a SQL expert might be necessary to ensure the quality and accuracy of the queries.

* Automated Testing Frameworks: Develop or use existing automated testing frameworks that can run the generated SQL queries against a database and validate the results based on predefined criteria.

* Performance Metrics: Use metrics like accuracy, precision, recall, and F1 score to evaluate the performance of your SQL generation model, especially if you have a test set with known correct answers.

#### ❓QUESTION❓

How would you extend this notebook to another use-case?

- you would follow the same steps as above and fine tune it using another dataset that is related to your new domain and use case. Here are some steps from chatgpt:

To extend the notebook to another use-case, consider the following steps:

1. Identify the Use-Case: Clearly define the new use-case. It could be anything from text classification, sentiment analysis, language translation, to another form of text generation like summarization or question answering.

2. Data Collection and Preparation: Gather and preprocess a dataset suitable for your new use-case. This might involve data cleaning, tokenization, and formatting the data in a way that's suitable for training your model.

3. Adapt the Model: Depending on the new use-case, you might need to modify the model architecture or choose a different pre-trained model more suited to your task.

4. Fine-Tuning: Use the new dataset to fine-tune the model on the new task. This will involve adjusting hyperparameters, training schedules, and possibly the loss function.

5. Validation and Testing: Develop a validation and testing strategy for your new use-case. This might involve creating or using existing benchmarks and metrics specific to the task.

6. Iterative Improvement: Based on the performance on the validation and test sets, iteratively improve the model by tweaking the model architecture, training data, or training procedure.

7. Deployment and Integration: Finally, consider how the model will be deployed and integrated into a larger system or application. This might involve additional steps like optimizing the model for production, creating APIs for model access, or integrating with other software components.

By following these steps, you can adapt the existing notebook to a wide range of NLP tasks and applications.