# Recipe Generation with GPT-2

Now we are going to explore how to fine-tune a **transformer** for a custom text completion task -- in particular, we will teach GPT-2 how to generate recipes given a list of ingredients.

Before the world met ChatGPT, GPT-2 was already showing us what generative AI could do -- if we were paying attention. To learn more about this large language model, which was **trained on 8 million web pages**, check out the official model card on [Hugging Face](https://huggingface.co/gpt2).

Before we get cooking, here's a roadmap of what is to come:
- GPU time
- Download the dataset
- Create the DataFrame
- Make the training and test set
- Load the pretrained GPT2 model

## GPU time

In [None]:
# Check which GPU is available for model training
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-51852616-34bb-61bb-90e3-5b0637997e0e)


In [None]:
# Install, if needed, and import the Hugging Face Transformers library
!pip install transformers
import transformers



## Download the dataset

In [None]:
import pandas as pd

In [None]:
# Install gdown, a Python tool that helps you download files from Google Drive directly into your Colab
!pip install --upgrade gdown



This dataset includes 120K recipes! All the recipes are formatted in the same way, and, as we will see, formatting is critical for solving text completion tasks like this. Part of the magic of transformers is that they are able to recognize and learn patterns.

In [None]:
import gdown
# Download the recipe dataset (120K examples) from Google Drive
gdrivelink='https://drive.google.com/uc?id=10KF1LqW9k2MgTb1GwSPlfvX1INFPDxi5'
gdown.download(gdrivelink, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=10KF1LqW9k2MgTb1GwSPlfvX1INFPDxi5
From (redirected): https://drive.google.com/uc?id=10KF1LqW9k2MgTb1GwSPlfvX1INFPDxi5&confirm=t&uuid=d79b4341-ef5f-4508-b44b-390e12d1dc03
To: /content/recipes.csv
100%|██████████| 343M/343M [00:02<00:00, 156MB/s]


'recipes.csv'

Recipe dataset at  https://eightportions.com/datasets/Recipes/

## Create the DataFrame

In [None]:
# Create the DataFrame that we just downloaded
df = pd.read_csv('recipes.csv')

In [None]:
# Check the first five rows
df.head()

Unnamed: 0.1,Unnamed: 0,title,ingredients,instructions,source,ingredients_length,instructions_length,combined
0,0,Slow Cooker Chicken and Dumplings,"['4 skinless, boneless chicken breast halves '...","place the chicken, butter, soup, and onion in ...",ar,5,54,"\n Ingredients: \n 4 skinless, boneless chick..."
1,1,Awesome Slow Cooker Pot Roast,['2 (10.75 ounce) cans condensed cream of mush...,"in a slow cooker, mix cream of mushroom soup, ...",ar,4,45,\n Ingredients: \n 2 (10.75 ounce) cans conde...
2,2,Brown Sugar Meatloaf,"['1/2 cup packed brown sugar ', '1/2 cup ketch...",preheat oven to 350 degrees f (175 degrees c)....,ar,10,68,\n Ingredients: \n 1/2 cup packed brown sugar...
3,3,Best Chocolate Chip Cookies,"['1 cup butter, softened ', '1 cup white sugar...",preheat oven to 350 degrees f (175 degrees c)....,ar,11,75,"\n Ingredients: \n 1 cup butter, softened \n..."
4,4,Homemade Mac and Cheese Casserole,"['8 ounces whole wheat rotini pasta ', '3 cups...",preheat oven to 350 degrees f. line a 2-quart ...,ar,13,176,\n Ingredients: \n 8 ounces whole wheat rotin...


In [None]:
# Check the lenght
len(df)

121456

In [None]:
# Now check the shape
df.shape

(121456, 8)

In [None]:
# Check for null values
df.isna().sum()

Unnamed: 0,0
Unnamed: 0,0
title,0
ingredients,0
instructions,0
source,0
ingredients_length,0
instructions_length,0
combined,0


Time to explore the "Combined" column -- the only column that matters in this text generation task that we are going to have GPT2 complete.

In [None]:
# 'Combined' is the only column we care about in this exercise
# It's the combined ingredients and instructions and it's essential that we format the training set vert carefully
df.combined

Unnamed: 0,combined
0,"\n Ingredients: \n 4 skinless, boneless chick..."
1,\n Ingredients: \n 2 (10.75 ounce) cans conde...
2,\n Ingredients: \n 1/2 cup packed brown sugar...
3,"\n Ingredients: \n 1 cup butter, softened \n..."
4,\n Ingredients: \n 8 ounces whole wheat rotin...
...,...
121451,\n Ingredients: \n 4 ears fresh corn \n 2 hea...
121452,\n Ingredients: \n 4 large plum tomatoes \n s...
121453,\n Ingredients: \n 3 tablespoons olive oil \n...
121454,\n Ingredients: \n 8 ounces butter \n 8 ounce...


In [None]:
# Let's improve the formatting by printing the column and selecting .iloc[0], or just the first row
# Important note: So, this is the format for every element in this column
print(df.combined.iloc[0])

 
 Ingredients: 
 4 skinless, boneless chicken breast halves  
 2 tablespoons butter  
 2 (10.75 ounce) cans condensed cream of chicken soup  
 1 onion, finely diced  
 2 (10 ounce) packages refrigerated biscuit dough, torn into pieces  
 Instructions: 
 place the chicken, butter, soup, and onion in a slow cooker, and fill with enough water to cover. cover, and cook for 5 to 6 hours on high. about 30 minutes before serving, place the torn biscuit dough in the slow cooker. cook until the dough is no longer raw in the center.  <|endoftext|>


Notice: Every set of Instructions ends with the following -- <|endoftext|>. This is crucial because the model needs to know where in the training set a sentence ends.

Also, notice that ingredients in the text often appear in the same order in the Ingredients section *and* the Instructions section. Humans can notice this pattern, of course, and transformers can, too.

So, here's **the gameplan**: We will show GPT-2 120K examples where the Ingredients section is followed by the Instructions section. Then we will provide it just the Ingredients section and it will be prompted to generate text, or Instructions.

## Make the training and test set

So, how do you measure the performance of generating text?

It's a little murky.

Even if there is not a clear metric we can use, we will still split the dataset into two -- training and test.

In [None]:
# Create the training set, just the 'combined' column values and grab the first 120,000
dataset_train = df.combined.values[:120000]

In [None]:
# Check the length
len(dataset_train)

120000

In [None]:
# For the test set, we will take the rest
dataset_test = df.combined.values[120000:]

1456

In [None]:
# Check the length
len(dataset_test)

1456

So, as we see above, we have a huge training set. But we need to turn the training and text sets into **text files**. And this is why those <|endoftext|> tokens are so important.

In [None]:
# Here's how to create the text file for 'dataset_train'
with open('dataset_train.txt', 'w') as f:
  f.write('\n'.join(dataset_train))

In [None]:
# And now for the test set
with open('dataset_test.txt', 'w') as f:
  f.write('\n'.join(dataset_test))

In [None]:
# These are massive files, so, let's just display a small piece of 'dataset_train.txt'
!head -20 dataset_train.txt

 
 Ingredients: 
 4 skinless, boneless chicken breast halves  
 2 tablespoons butter  
 2 (10.75 ounce) cans condensed cream of chicken soup  
 1 onion, finely diced  
 2 (10 ounce) packages refrigerated biscuit dough, torn into pieces  
 Instructions: 
 place the chicken, butter, soup, and onion in a slow cooker, and fill with enough water to cover. cover, and cook for 5 to 6 hours on high. about 30 minutes before serving, place the torn biscuit dough in the slow cooker. cook until the dough is no longer raw in the center.  <|endoftext|>
 
 Ingredients: 
 2 (10.75 ounce) cans condensed cream of mushroom soup  
 1 (1 ounce) package dry onion soup mix  
 1 1/4 cups water  
 5 1/2 pounds pot roast  
 Instructions: 
 in a slow cooker, mix cream of mushroom soup, dry onion soup mix and water. place pot roast in slow cooker and coat with soup mixture. cook on high setting for 3 to 4 hours, or on low setting for 8 to 9 hours.  <|endoftext|>
 
 Ingredients: 
 1/2 cup packed brown sugar  


## Load the pretrained GPT-2 model

We're now ready to load the core ingredients: the **pretrained GPT-2 model** and its **tokenizer**. We'll use Hugging Face's `transformers` library to grab both.

Specifically, we’ll be using:
1. `AutoModelForCausalLM` -- the GPT-2 language model for text generation
2. `AutoTokenizer` -- the tokenizer that breaks text into tokens the model understands

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
# Here you can either use your own model or download one from Hugging Face
# We, of course, will be using GPT-2 but note: The distiled version, which is small, fast
# and still almost as accurate as the famous original
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Alternatively, load a fine-tuned model we previously saved

In [None]:
import gdown
gdrivelink='https://drive.google.com/drive/folders/1qHEQ6zpOeGDiBlZ3q4zkpMVSVa86d69q?usp=sharing'
gdown.download_folder(gdrivelink, quiet=True)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('recipe_generation_model')
model = AutoModelForCausalLM.from_pretrained('recipe_generation_model')

## Test the model on one sentence


Let's find out how the pretrained model does on the task of completing a sentence.

We’ll first encode a sentence using the tokenizer and see what the model sees.

**Side Note**: When we use `return_tensors='pt'`, we're actually pulling in **PyTorch** under the hood. Even though we didn’t explicitly import it, Hugging Face’s `transformers` library is built on top of PyTorch by default.
>
So, the output is a PyTorch tensor  -- a powerful data structure that we’ll work with more as we go deeper into model building.

In [None]:
# First, define the input text
# And then encode it with 'tokenizer.encode()'
input_text = 'I hit the slopes early to snowboard, caught the first lift, and by noon I had already'
enc_input = tokenizer.encode(input_text, return_tensors='pt', add_special_tokens=False)
enc_input

tensor([[   40,  2277,   262, 35082,  1903,   284,  6729,  3526,    11,  4978,
           262,   717, 10303,    11,   290,   416, 19613,   314,   550,  1541]])

Now that we've tokenized our input sentence, we can feed it into the model and generate predictions to complete the sentence.

Note: This block of code is a **beast** so make sure to read through those comment lines.

In [None]:
output_sequences = model.generate(
    input_ids = enc_input, # The encoded input from earlier
    max_length= 70,  # The max length of the generated sentence
    temperature = 0.9, # Controls randomness, closer to 1 = more creative, closer to 0 = more predictable
    top_k = 20, # Considers only the top 20 most likely next words
    top_p = 0.9, # Allows dynamic cutoff of word options based on cumulative probabilioty
    repetition_penalty = 1, # Penalty for repeating a word in the input
    do_sample = True, # If true, this allows for randomness instead of always picking the highest-probability word
    num_return_sequences = 5 # Number of output sentences
)
for i in range(len(output_sequences)):
  print(f'{i}: {tokenizer.decode(output_sequences[i])}\n')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: I hit the slopes early to snowboard, caught the first lift, and by noon I had already taken the first step.

As I was climbing, I saw my first line of snowboarders and I began to climb up the slopes. I knew it was all about the way back, and when I started to make the first step,

1: I hit the slopes early to snowboard, caught the first lift, and by noon I had already set out to take my way to the snowboard. I couldn't find the lift that I needed to make it, but I found it. I climbed on the slopes and went to a snowboard shop and got some ice. I had been there

2: I hit the slopes early to snowboard, caught the first lift, and by noon I had already done my part and ran on it. The next day I started to see the snowpack and to see how hard it was to do it. It was a pretty challenging day. I had a couple of really strong legs that were very hard to do

3: I hit the slopes early to snowboard, caught the first lift, and by noon I had already reached the top of the hill, and then I could 

Obviously, these outputs are nonsense. But what about a recipe? Can are model accomplish that?

In [None]:
# Display the first element from 'dataset_train'
dataset_train[0]

' \n Ingredients: \n 4 skinless, boneless chicken breast halves  \n 2 tablespoons butter  \n 2 (10.75 ounce) cans condensed cream of chicken soup  \n 1 onion, finely diced  \n 2 (10 ounce) packages refrigerated biscuit dough, torn into pieces  \n Instructions: \n place the chicken, butter, soup, and onion in a slow cooker, and fill with enough water to cover. cover, and cook for 5 to 6 hours on high. about 30 minutes before serving, place the torn biscuit dough in the slow cooker. cook until the dough is no longer raw in the center.  <|endoftext|>'

In [None]:
# Need to split the above output and add that 'Instructions' column as a prompt
dataset_train[0].split('Instructions:')[0]+'Instructions:'

' \n Ingredients: \n 4 skinless, boneless chicken breast halves  \n 2 tablespoons butter  \n 2 (10.75 ounce) cans condensed cream of chicken soup  \n 1 onion, finely diced  \n 2 (10 ounce) packages refrigerated biscuit dough, torn into pieces  \n Instructions:'

In [None]:
# Now, let's try it as our new input text
# Note: Need to increase the length of the generated sentence
new_input_text = dataset_train[0].split('Instructions:')[0]+'Instructions:'
enc_input = tokenizer.encode(new_input_text, return_tensors='pt', add_special_tokens=False)
enc_input
output_sequences = model.generate(
    input_ids = enc_input, # The encoded input from earlier
    max_length= 150,  # The max length of the generated sentence
    temperature = 0.9, # Controls randomness, closer to 1 = more creative, closer to 0 = more predictable
    top_k = 20, # Considers only the top 20 most likely next words
    top_p = 0.9, # Allows dynamic cutoff of word options based on cumulative probabilioty
    repetition_penalty = 1, # Penalty for repeating a word in the input
    do_sample = True, # If true, this allows for randomness instead of always picking the highest-probability word
    num_return_sequences = 5 # Number of output sentences
)
for i in range(len(output_sequences)):
  print(f'{i}: {tokenizer.decode(output_sequences[i])}\n')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0:  
 Ingredients: 
 4 skinless, boneless chicken breast halves  
 2 tablespoons butter  
 2 (10.75 ounce) cans condensed cream of chicken soup  
 1 onion, finely diced  
 2 (10 ounce) packages refrigerated biscuit dough, torn into pieces  
 Instructions: 1. In a large saucepan, combine the 2 tablespoons butter and 1/2 tablespoons butter 

2 tablespoons butter 
1/2 cup coconut oil
2 tablespoons butter 
2 tablespoons butter 
2 tablespoons butter 
1/2 teaspoon vanilla
3 tablespoons butter 
3 tablespoons butter 
3 tablespoons butter 
1/4 cup coconut oil
1/2 cup cream

1:  
 Ingredients: 
 4 skinless, boneless chicken breast halves  
 2 tablespoons butter  
 2 (10.75 ounce) cans condensed cream of chicken soup  
 1 onion, finely diced  
 2 (10 ounce) packages refrigerated biscuit dough, torn into pieces  
 Instructions:



















































































2:  
 Ingredients: 
 4 skinless, boneless chicken breast halves  
 2 tablespoons butter  

The results aren't a total mess -- but they need a lot of help. In fact, some are blank.

Worth noting: GPT-2 is speaking gramatically correct English but this is not at all accomplishign the task we set out to do. So, up next? Fine-tuning the model.

# Fine-tune the model

Download the python scrip for fine-tuning

In [None]:
!curl https://webpages.scu.edu/ftp/msamorani/NLP/run_lm_finetuning.py > run_lm_finetuning.py

In [None]:
!mkdir experiments

## Fine-tune (original lecture)

In [None]:
# In the original lecture, I run this, but with the latest version of the transformers package the output is overwhelming
!bash run_experiments.sh

## Fine-tune (better output)

In [None]:
# run this cell instead of the one above to limit the output

line_inteval = 100 # in order to limit the output size, print only one line every 100

import subprocess

process = subprocess.Popen(
    ["bash", "run_experiments.sh"],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True,
    bufsize=1  # line-buffered
)

for i, line in enumerate(process.stdout, 1):
    if i % 100 == 0:
        print(f"[Line {i}] {line.strip()}")

## Save the model

# Testing