# Text Generation using GPT (Using Huggingface)

## Project Setup

## Note:
transformers is a python library for implementing transformers arechetcture neural networds on huggin face, by defualt it shouldnt be in oyour python library , so the below command should help you install the package into your notebook session.

In [None]:
!pip install -q transformers

# Note:

below are some imports we might be needing for pre-processing(converting to tokens) and torch pytorch (another Neural Network libray) that we might be needing later.

google lib is for some operations to interact with your google drive

In [None]:
import torch
import shutil
from torch.utils.data import Dataset, random_split
from transformers import Trainer, TrainingArguments, GPTNeoForCausalLM, GPT2Tokenizer


from google.colab import drive


## Data Preparation

## Note:
! is magic function , that would run a "shell" command.
the below command downloads the text file that exists on Github into the python session.

In [None]:
# Load data into colab
!wget https://raw.githubusercontent.com/dsirepos/yumyum/main/recipes_13july_v2.txt

--2023-07-13 22:49:07--  https://raw.githubusercontent.com/dsirepos/yumyum/main/recipes_13july_v2.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1009918 (986K) [text/plain]
Saving to: ‘recipes_13july_v2.txt’


2023-07-13 22:49:08 (47.7 MB/s) - ‘recipes_13july_v2.txt’ saved [1009918/1009918]



## Note:
below function is a "File system" operation. where we are making a connection with your google drive account.


this should prompt you to give permissions to your google drive, give them.
NO need to run this everytime during testing ,should be fine for first execution.

In [None]:
# Connects colab to google drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Note:
shutil stands for shell utilis , another python library to perfrom some operations similar to shell commands.

In this case we are jut copying the text data we downloaded from Github to your Google drive, make sure you have a valid path in your google drive


In [None]:
shutil.copy("/content/recipes_13july_v2.txt","drive/MyDrive/AICamp/yumyum_v2")


'drive/MyDrive/AICamp/yumyum_v2/recipes_13july_v2.txt'

## Note:
define a function , to read text data from a file and store as list of lines.

In [None]:
file_path = "/content/drive/MyDrive/AICamp/yumyum_v2/recipes_13july_v2.txt"

with open(file_path,'r',encoding='utf-8', errors='' ) as f:
  text_corpus = f.read()


recipes = text_corpus.replace('>>', ': ').split('\n\n')



## Print data:

In [None]:
recipes[:4]


['give me recipe for Cheeseburger Potato Soup:\n Wash potatoes,\n prick several times with a fork,\n Microwave them with a wet paper towel covering the potatoes on high for 6-8 minutes,\n The potatoes should be soft ready to eat,\n Let them cool enough to handle,\n Cut in half lengthwise,\n scoop out pulp and reserve,\n Discard shells,\n Brown ground beef until done,\n Drain any grease from the meat,\n Set aside when done,\n Meat will be added later,\n Melt butter in a large kettle over low heat,\n add flour stirring until smooth,\n Cook 1 minute stirring constantly  Gradually add milk,\n cook over medium heat stirring constantly until thickened and bubbly,\n Stir in potato ground beef salt pepper 1 cup of cheese 2 tablespoons of green onion and 1/2 cup of bacon,\n Cook until heated (do not boil),\n Stir in sour cream if desired,\n cook until heated (do not boil),\n Sprinkle with remaining cheese bacon and green onions ,\nNER:sour cream bacon pepper extra lean ground beef cheddar chees

## Note:
'\n' is an escape sequence , which seems weird, but it takes the cursor to a new line.

## Note:
1. the below code, removes unnecessary start and end words. its a simple list slicing , where we are starting out with a thrid sentence and going upto last line.
2. and then removing lines with length zero , which means empty lines.

## Note:
TOkenization : https://huggingface.co/docs/transformers/

Below is a class in python , takes a couple of arguements at instanciation.
1. txt list : sentences
2. tokenizer : tokenizer to be used .
3. max_length : max length of input


3. this should encode the input sentences using the tokenizer and create 3 instance variables named
1. input_ids : index for the tokens
2. attention_mask: flag that represents a specigic token is important or not
3. labels: ?

In [None]:
# Custome dataset class to load dataset
class RecipeDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            # Encode the descriptions using the GPT-Neo tokenizer
            encodings_dict = tokenizer('<|startoftext|>'
                                        + txt +
                                        '<|endoftext|>',
                                        truncation=True,
                                        max_length=max_length,
                                            padding="max_length")
            input_ids = torch.tensor(encodings_dict['input_ids'])
            self.input_ids.append(input_ids)
            mask = torch.tensor(encodings_dict['attention_mask'])
            self.attn_masks.append(mask)

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

## Initialize tokenizer, model

## Note:
Steps being implemented below:
1.  manual_Seed(42) , a seed used when there is a deterministic randomization in any process. OFten the arechtecture of neural networks where the weights are adjsuted automatically . so seed helps to reproduce the same result, the word 42 is specically telling to reproduce results.

2. we are creating a tokenizer instance , where we are giving a few arguments,
  a. name of the pretrained tokenizer on hgginf face , 'user_id/model_name'
  b. bos_token: start of your input text data
  c. eos_token: end of your input text
  d. pad : often times the sequences( phrases or sentences are of variable length, internal when feeding it to the neural networks it has to be of fixed length, pad token is used to fill in if the length is less than the expected legnth for the arechetecture)

4. initiallize model , name of the model, and cuda function to tell colab to execute the training on gpus

5. model resize to the tokens additionally added

In [None]:
# Set the random seed to a fixed value to get reproducible results
torch.manual_seed(42)

# Download the pre-trained GPT-Neo model's tokenizer
# Add the custom tokens denoting the beginning and the end
# of the sequence and a special token for padding
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-125M",
                            bos_token='<|startoftext|>',
                            eos_token='<|endoftext|>',
                            pad_token='<|pad|>')

# Download the pre-trained GPT-Neo model and transfer it to the GPU
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M").cuda()

# Resize the token embeddings because we've just added 3 new tokens
model.resize_token_embeddings(len(tokenizer))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Embedding(50259, 768)

## Train/Test Split data

## Note:
Standard practice in ML , where we split our input data into 3 parts
1. train data : 70% of the total input
2. test data : 15%
3. validation data : 15%

these numbers are not a fixed rule to be used everytime, but changes depending on the problem being solved and size of data, there is no definitive rule to choosing the ratios for spliting data. but often times the above ratios are used.


In [None]:
# subset

subset = [x for x in recipes if len(tokenizer.encode(x)) < 260 and len(tokenizer.encode(x)) > 220 ]

subset  = subset[:1000]



In [None]:

max_length = max([len(tokenizer.encode(recipe)) for recipe in recipes])
# min_length = min([len(tokenizer.encode(recipe)) for recipe in recipes])
# avg_length = sum([len(tokenizer.encode(recipe)) for recipe in recipes])/len(recipes)

# Load dataset
dataset = RecipeDataset(recipes, tokenizer, max_length)

# Split data into train/val
train_size = int(0.8 * len(dataset))

train_data, val_data = random_split(dataset, [train_size, len(dataset) - train_size])

# max_length, min_length, avg_length

In [None]:
print(f"Max sequence length : {max_length}")
print(f"training set length : {len(train_data)}")
print(f"validation set length : {len(val_data)}")


Max sequence length : 270
training set length : 801
validation set length : 201


## Tokenizer functions :

tokenizer typically should have 2 functions
1. encode : convert the text data into tokens
2. decode : put back in the original text form.

 a more detailed explaination in the below link:
 https://huggingface.co/docs/transformers/preprocessing

In [None]:
tokenizer.batch_decode(val_data[0])

["<|startoftext|> give me recipe for Allegro'S Stuffed Green Peppers:\n Mix water bulgur onion salt and garlic powder in 2-quart casserole,\n Cover tightly and microwave on High 6 to 8 minutes or until boiling,\n stir,\n Cover and let stand until water is absorbed about 10 minutes,\n Cut thin slice from stem end of each bell pepper,\n Remove seeds and membranes,\n rinse,\n Arrange peppers cut ends up in circle in pie plate (9 x 1 1/4 or 10 x 1 1/2 inches)  Crumble ground beef into bulgur mixture,\n stir in 1 cup of the tomato sauce,\n Fill each pepper with about 1/2 cup mixture,\n Pour remaining tomato sauce over peppers,\n Cover tightly and microwave on High 12 to 14 minutes rotating pie plate 1/2 turn after 7 minutes until beef mixture is done (160 degrees Fahrenheit on meat thermometer)  Sprinkle peppers with cheese,\n let stand uncovered 5 minutes,\nNER:bulgur hot water green bell peppers cheddar cheese onion garlic powder tomato sauce ground beef salt.<|endoftext|> <|pad|> <|pad|>

## Train Model

## Note:
Training arguemnts : is aclass in transformer python module, that should help you configure your neural network with various parameters:

1. output_dir : place to save model check points
2. num_train_epochs= 5 epoch is a time variable, where in this context, tells how many times the entire network should loop while trainig ( how many times the input should pass through the network with back propogation)
3. logging steps : freq at which infommation has to be logged onto the console for understanding the train process.
4. save stps: freq at which check points are saved .
5. evaliation strategy : steps ; how the model to be evaluated
6. eval steps: freq at which the model trainig pricess has to evaluated
7. per device train batch size : batch size for train set  gpu( t4) tensor chip on colab
7. per device eval e size : batch size for eval set gpu( t4) tensor chip on colab
8. warm up rate:
9. learning rate: step size or how big the gradient should be , to avoid exploding gradients or vanishing gradients .
10 . weight decay : specify how much weight decay should be applied during back propogration
11. loggin dir : directory for training logs ( log informaion is useful information that gives detials during the train process)

## Note:
Module dependency verison issue, if you encounter this , it should prompt you to run the below magic function , run it and restart your run time , to execute smoothly

In [None]:
!pip install accelerate -U




## Note:
I wouldn't recommand the below cell , if your just testing how this works, as it takes siginifcant amount of time to train the model for 3 different learning rates.

If your trying out your model for different learning rates, then its fine, otherwise its not wise to run it numerous times.

Instead run the below cell, which trains the model only for the best suitable learning rate.

In [None]:

# Here I will pass the output directory where
# the model predictions and checkpoints will be stored,
# batch sizes for the training and validation steps,
# and warmup_steps to gradually increase the learning rate
learning_rates = [5e-5, 3e-5, 1e-5]

for learning_rate in learning_rates:

    training_args = TrainingArguments(output_dir=f'./results_{learning_rate}',
                                      num_train_epochs=10,
                                      logging_steps=1000,
                                      save_steps=1000,
                                      evaluation_strategy='steps',
                                      eval_steps=1000,
                                      per_device_train_batch_size=2,
                                      per_device_eval_batch_size=2,
                                      warmup_steps=100,
                                      learning_rate=learning_rate,
                                      weight_decay=0.01,
                                      logging_dir=f'./logs_{learning_rate}')

    trainer = Trainer(model=model, args=training_args,
                      train_dataset=train_data,
                      eval_dataset=val_data,
                      # This custom collate function is necessary
                      # to built batches of data
                      data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                  'attention_mask': torch.stack([f[1] for f in data]),
                  'labels': torch.stack([f[0] for f in data])})

    # Start training process!
    print(f"Training result for learning rate: {learning_rate}")
    trainer.train()
    print("\n\n")

Training result for learning rate: 5e-05




Step,Training Loss,Validation Loss


KeyboardInterrupt: ignored

BAsed on the results above, it looks like model trained with learning rate = 5e-5 is more promising than others.

## Note:
The lesser the loss ,the better

In [None]:
training_args = TrainingArguments(output_dir=f'./results',
                                      num_train_epochs=5,
                                      logging_steps=1000,
                                      save_steps=5000,
                                      evaluation_strategy='steps',
                                      eval_steps=1000,
                                      per_device_train_batch_size=2,
                                      per_device_eval_batch_size=2,
                                      warmup_steps=100,
                                      learning_rate=3e-5,
                                      weight_decay=0.01,
                                      logging_dir=f'./logs')

trainer = Trainer(model=model, args=training_args,
                  train_dataset=train_data,
                  eval_dataset=val_data,
                  # This custom collate function is necessary
                  # to built batches of data
                  data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
              'attention_mask': torch.stack([f[1] for f in data]),
              'labels': torch.stack([f[0] for f in data])})

# Start training process!
trainer.train()


# Save model in the specified file path
trainer.save_model("drive/MyDrive/AICamp/models/yumyum_v2/")
tokenizer.save_pretrained("drive/MyDrive/AICamp/models/yumyum_v2")


Step,Training Loss,Validation Loss
1000,2.1315,2.220036
2000,1.7525,2.224234


('drive/MyDrive/AICamp/models/yumyum_v2/tokenizer_config.json',
 'drive/MyDrive/AICamp/models/yumyum_v2/special_tokens_map.json',
 'drive/MyDrive/AICamp/models/yumyum_v2/vocab.json',
 'drive/MyDrive/AICamp/models/yumyum_v2/merges.txt',
 'drive/MyDrive/AICamp/models/yumyum_v2/added_tokens.json')

## Checking Model Output

## Note:

return tensors: tensors are specific term in the transformers tokenizer, it also represent tokens ,and 'pt' stands for pytorch.

1. generates: its the encoded text, we made using the transformers pretrained tokenizer we initialized above,
2. feeding to the saved model and uinsg the generate function to generate some text
3. max length : should be input lenght of tokens
4. num return sequences is number of sequences it should return as a reuslt, in our case number of sonnets(poems) it genertes.

In [None]:
generated = tokenizer("<|startoftext|>", return_tensors="pt").input_ids.cuda()
print(f"generated: {generated[:1]}")
sample_outputs = model.generate(generated, do_sample=True, top_k=20,
                                # bos_token='<|startoftext|>',
                                # eos_token='<|endoftext|>', pad_token='<|pad|>',
                                max_length=300, top_p=0.95, temperature=0.5, num_return_sequences=20)
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))


In [None]:
# Prompt the user for input
prompt = input("Ask for a Dish>> ")

prompt = f"give me recipe for {prompt}"

# Encode the prompt using the tokenizer
encoded_prompt = tokenizer.encode(prompt, return_tensors="pt").cuda()

# Generate the output based on the prompt
sample_outputs = model.generate(encoded_prompt, do_sample=True, top_k=20,
                                max_length=300, top_p=0.95, temperature=0.5, num_return_sequences=1)

# Decode and print the generated outputs
for i, sample_output in enumerate(sample_outputs):
    decoded_output = tokenizer.decode(sample_output, skip_special_tokens=True)
    print("{}: {}".format(i, decoded_output))


Ask for a Dish>> Microwave Lasagne


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: give me recipe for Microwave Lasagne:
 Mix the milk sugar and butter in a bowl,
 Add the flour and mix well,
 Add the milk and mix well,
 Add the eggs and mix well,
 Add the flour mixture to the milk mixture and mix well,
 Pour the mixture into a bowl and stir with a fork until all the flour is added,
 Add the milk mixture to the egg mixture and mix well,
 Add the flour mixture to the milk mixture and mix well,
 Stir the dough until it is a ball,
 Divide the dough into 12 equal pieces and shape into a ball,
 Roll each piece into a circle and put in a greased baking dish,
 Cover with a towel and let rise in a warm place until doubled in size about 30 minutes,
 Bake at 350 degrees Fahrenheit for 25 minutes,
 Remove from the oven and cool on a wire rack,
NER:egg sugar flour milk butter eggs.


In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"


In [None]:
! transformers-cli env

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-07-13 23:22:34.160283: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.30.2
- Platform: Linux-5.15.109+-x86_64-with-glibc2.31
- Python version: 3.10.12
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): 2.12.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.7.0 (gpu)
- Jax version: 0.4.13
- JaxLib version: 0.4.13
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>



## Upload model to huggingface

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from huggingface_hub import HfApi

api = HfApi()

In [None]:
# Create your repo first to upload the model
api.create_repo(repo_id="yum")

RepoUrl('https://huggingface.co/callMeRover/yum', endpoint='https://huggingface.co', repo_type='model', repo_id='callMeRover/yum')

In [None]:
# Upload your model to huggingface. You can clone the repo anytime to use the model.
import os

model_pth = "drive/MyDrive/AICamp/models/yumyum_v2"

files = os.listdir(model_pth)
print(files)

for fi in files:
    print(os.path.join(model_pth, fi))

    api.upload_file(
        path_or_fileobj=os.path.join(model_pth, fi),
        path_in_repo=fi,
        repo_id="callMeRover/yum",
        repo_type="model",
    )

['config.json', 'generation_config.json', 'pytorch_model.bin', 'training_args.bin', 'tokenizer_config.json', 'special_tokens_map.json', 'added_tokens.json', 'vocab.json', 'merges.txt']
drive/MyDrive/AICamp/models/yumyum_v2/config.json
drive/MyDrive/AICamp/models/yumyum_v2/generation_config.json
drive/MyDrive/AICamp/models/yumyum_v2/pytorch_model.bin


pytorch_model.bin:   0%|          | 0.00/551M [00:00<?, ?B/s]

drive/MyDrive/AICamp/models/yumyum_v2/training_args.bin


training_args.bin:   0%|          | 0.00/3.83k [00:00<?, ?B/s]

drive/MyDrive/AICamp/models/yumyum_v2/tokenizer_config.json
drive/MyDrive/AICamp/models/yumyum_v2/special_tokens_map.json
drive/MyDrive/AICamp/models/yumyum_v2/added_tokens.json
drive/MyDrive/AICamp/models/yumyum_v2/vocab.json
drive/MyDrive/AICamp/models/yumyum_v2/merges.txt


In [None]:
"""
parameters for inference api call
"""
## original
# parameters = {
#     "top_k" : 10,
#     "max_length": 100,
#     "temperature" : 0.2,
#     "top_p" : 0.22,
#     "no_repeat_ngram_size" : 3,
#     "do_sample": True,
#     }


#  WORKING PARAMETERS
# parameters = {
#     "top_k" : 20,
#     "max_length": 300,
#     "temperature" : 0.5,
#     "top_p" : 0.95,
#     "do_sample": True,
#     }


# options = {"wait_for_model": True
# }

In [None]:
import requests


API_URL = "https://api-inference.huggingface.co/models/callMeRover/yumYum"
headers = {"Authorization": "Bearer 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"}


def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()


output = query({
    "inputs": "give me recipe for Butter Cookies: ",
    "parameters": parameters,
    "options" : options
})

output

[{'generated_text': 'give me recipe for Butter Cookies: **************\n\nI use unsalted butter in this recipe to make a savory brownie. I use brown sugar in this recipe to make a savory brownie. I use shortening in this recipe to make a savory brownie. I use sour cream in this recipe to make a savory brownie.'}]

In [None]:
import requests

API_URL = "https://api-inference.huggingface.co/models/callMeRover/yum"
headers = {"Authorization": "Bearer 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"}


def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

dish_name = input("Enter the dish name: ")

prompt = f"Give me recipe for {dish_name}:"

parameters = {
    "do_sample": True,
    "max_length": 400,
    "top_k": 30,
    "top_p": 0.95,
    "temperature": 0.5,
    "num_return_sequences": 1,
}


options = {"wait_for_model": True
}


output = query({
    "inputs": prompt,
    "parameters": parameters,
    "options" : options
})




In [None]:
result = output[0]['generated_text'].split('\n')
result

['Give me recipe for Misericordia Crabcakes:',
 ' Crust:',
 ' Buttercup,',
 ' In a large bowl combine flour baking powder sugar salt and cinnamon  Stir until well mixed  Add flour mixture to shell and mix well  Place crabmeat in a large bowl and cover with plastic wrap  Repeat the process with remaining ingredients  Cover and chill for 1 hour  Preheat oven to 375 degrees FahrenheitF  Bake crabcakes for 20 minutes or until golden brown and crisp  Cool on a wire rack for 5 minutes  Cool completely  Cut into 1-inch slices  Cut into 1-inch pieces and serve with whipped cream or with a small dollop of whipped cream or with a small dollop of whipped cream  Makes 4 servings,',
 'NER:sugar baking powder sugar cinnamon crabmeat flour salt.']

In [None]:
type(result)

list