### Instruction Tuning an LLM (Flan-T5) on custom data for Q&A purposes
### This is Part 1 of two planned parts wherein the first notebook deals with tuning the LLM whereas the second notebook deals with quantization and running inferences

Every organization these days is rushing to develop their own LLM. If that is not an option for you but you want to leverage LLMs capabilities for tasks such as Q&A, you are in the right place. In this notebook we will explore instruction tuning an LLM over our own data. More specifically, we will be leveraging the Q&A capabilities of LLMs to classify sentences based on a set of instructions the user provides.

But before we dive deeper, we need to understand what Instruction Tuning is all about. To put it briefly, Instruction Tuning is the method of training the model over to perform an NLP task in a certain way by structuring the data in the form of instructions.

For Example:<br>
**Instruction: Read the Context below and answer the Question based by picking from the Choices provided below<br>
Context: The Clinician said, "His brother is homeless"<br>
Question: "In the clinician\'s opinon, was the person themself homeless?""<br>
Choices: Yes; No<br>
Answer:** **<font color=red>No</font>**

The text highlighted in **bold** will constitute the data portion and the text highlighted in **<font color=red>red</font>** is what was commonly refer to as label in traditional Machine. So that's about what we need to know about the data.

For more information regarding the format of the prompt/context, please read this research paper: **Context-faithful Prompting for Large Language Models**: https://arxiv.org/pdf/2303.11315.pdf

You may be wondering that it would be computationally expensive to train an LLM. While that is true, we can circuvent this issue with the help of what is referred to as Parameter Effecient Fine-Tuning (PEFT). Using PEFT, we only need to train a portion of the model weights and not the entire model itself.  

Here is a list of the main packages used:
1. adapter-transformers (can import using "import transformers". Do not need "transformers" installed as this is an extension over the transformers library)
2. torch
3. datasets
4. tqdm
5. pandas
6. re
7. gc

We are importing the Garbage Collector Package and invoking it manually just to be sure that cache is being cleared

In [1]:
import gc
gc.collect()

254

My data is in a text file but if your data is not in a text file and is in CSV file, please skip the next two cells<br>

In [2]:
import re # used to extract entities from each sentence 
# Using readlines()
data_file = open('real_homeless_data_v1.3.txt', 'r')
Lines = data_file.readlines()

The data in my text file is in the same format as the example showcased at the beginning of this notebook. Using a regular expression that works for my data, I am extracting each sentence and storing it in a unqiue variable corresponding to each sentence. I then append all of them as elements of a lists into a list itself (list of lists). Please print the below data if you would like to better understand what the data looks like. 

In [3]:
csv_list = []  
for line in range(0, len(Lines),6):
    print(Lines[line] + Lines[line + 1] + Lines[line + 2] + Lines[line + 3] + Lines[line + 4])
    
#     break
    
    txt = Lines[line]
    i = re.findall("^Instruction: .*?Choices.", txt)
    print(i)
    
    txt = Lines[line + 1]
    c = re.findall("^Context: (The Clinician said, \".*\")", txt)
    print(c)
    
    txt = Lines[line + 2] 
    q = re.findall("^Question: (.*\?\")", txt)
    print(q)
    
    txt = Lines[line + 3] 
    ch = re.findall("^Choices: (.*No)", txt)
    print(ch)
    
    txt = Lines[line + 4]
    a = re.findall("^Answer: (.*)\.", txt)
    print(a)
    
    csv_list.append([i[0], c[0], q[0], ch[0], a[0]])
#     print(csv_list)
    
    
#     break


Instruction: Read what the Clinician said in the Context below and answer the Question by choosing from the below provided Choices.
Context: The Clinician said, "Social History: Pt is homeless." 
Question: "In the clinician's opinon, was the person themself homeless?"
Choices: Yes; No
Answer: Yes.

['Instruction: Read what the Clinician said in the Context below and answer the Question by choosing from the below provided Choices.']
['The Clinician said, "Social History: Pt is homeless."']
['"In the clinician\'s opinon, was the person themself homeless?"']
['Yes; No']
['Yes']
Instruction: Read what the Clinician said in the Context below and answer the Question by choosing from the below provided Choices.
Context: The Clinician said, "Social History: He is homeless, and stays in a shelter." 
Question: "In the clinician's opinon, was the person themself homeless?"
Choices: Yes; No
Answer: Yes.

['Instruction: Read what the Clinician said in the Context below and answer the Question by ch

Now, I am converting the data from list into a Dataframe and saving it as a CSV. If data is already in a CSV, better to start here. Have a good look at how the data is structured. Each column corresponds to a unqiue entity of the prompt. We have separate columns for Instruction, Context, Question, Choices, and Answer. I am sure there are better ways of storing your data and using them for training but this works for now.

In [4]:
import pandas as pd

df = pd.DataFrame (csv_list, columns = ['Instruction','Context', 'Question', 'Choices', 'Answer'])


df.to_csv("train_data.csv", sep=',')
df.head(100)

Unnamed: 0,Instruction,Context,Question,Choices,Answer
0,Instruction: Read what the Clinician said in t...,"The Clinician said, ""Social History: Pt is hom...","""In the clinician's opinon, was the person the...",Yes; No,Yes
1,Instruction: Read what the Clinician said in t...,"The Clinician said, ""Social History: He is hom...","""In the clinician's opinon, was the person the...",Yes; No,Yes
2,Instruction: Read what the Clinician said in t...,"The Clinician said, ""He is homeless and curren...","""In the clinician's opinon, was the person the...",Yes; No,Yes
3,Instruction: Read what the Clinician said in t...,"The Clinician said, ""The patient is currently ...","""In the clinician's opinon, was the person the...",Yes; No,Yes
4,Instruction: Read what the Clinician said in t...,"The Clinician said, ""previously from , moved ...","""In the clinician's opinon, was the person the...",Yes; No,No
5,Instruction: Read what the Clinician said in t...,"The Clinician said, ""The patient was born in ...","""In the clinician's opinon, was the person the...",Yes; No,Yes
6,Instruction: Read what the Clinician said in t...,"The Clinician said, "" Mr. is homeless and li...","""In the clinician's opinon, was the person the...",Yes; No,Yes
7,Instruction: Read what the Clinician said in t...,"The Clinician said, ""Social History: homeless;...","""In the clinician's opinon, was the person the...",Yes; No,Yes
8,Instruction: Read what the Clinician said in t...,"The Clinician said, ""He is now homeless and wa...","""In the clinician's opinon, was the person the...",Yes; No,Yes
9,Instruction: Read what the Clinician said in t...,"The Clinician said, ""Currently homeless, but f...","""In the clinician's opinon, was the person the...",Yes; No,Yes


Removing duplicates in the dataframe based on the Context column

In [5]:
df = df.drop_duplicates(subset = "Context")

Importing Model from Hugging Face Hub. For this example, we will go with google/flan-t5-xl which should be under 15 GB<br>

In [6]:
import numpy as np
from tqdm import tqdm
from torch.utils.data import DataLoader
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
import os
import torch as p
from transformers import DataCollatorForSeq2Seq

In [7]:
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The model is approximately as 77 Million Parameter Model as showcased by the code in the cell below

In [8]:
pytorch_total_params = sum(p.numel() for p in model.parameters())
print("Total Parameters: ", pytorch_total_params) 

Total Parameters:  2849757184


However, if we use PEFT methodology for tuning our model, we need not train all the parameters of the model. We can train a disparate lightweight module and then merge the weights of the module with the weights of the original model. <br><br>
For our example, we will be using the (IA)<sup>3</sup> method for tuning the model. 
To learn more about this methodology in particular, please take a look at this paper: https://arxiv.org/abs/2205.05638

The implementation of (IA)<sup>3</sup> is present as a part of the adapter-transformers package. It is an extension over the transformers package along with PEFT tuning methods. If your device does not recognise this package even after installation, just make sure to uninstall the package transformers and only keep adapter-transformers in your environment

We will need to grab the (IA)<sup>3</sup> config and pass it as a parameter to the add_adapter function. The name of the adapter can be anything you like really.

In [9]:
from transformers.adapters import IA3Config

config = IA3Config()
model.add_adapter("ia3_adapter", config = config)

Now that we have added an adapter, it is important to denote which adapter it is that we want to train. model.train() proceeds to freeze the models weights so that we only train the disparate lightweight module

In [10]:
model.train_adapter("ia3_adapter")
model.train()

T5ForConditionalGeneration(
  (shared_parameters): ModuleDict()
  (shared): Embedding(32128, 2048)
  (encoder): T5Stack(
    (invertible_adapters): ModuleDict()
    (embed_tokens): Embedding(32128, 2048)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(
                in_features=2048, out_features=2048, bias=False
                (loras): ModuleDict()
              )
              (k): Linear(
                in_features=2048, out_features=2048, bias=False
                (loras): ModuleDict(
                  (ia3_adapter): LoRA()
                )
              )
              (v): Linear(
                in_features=2048, out_features=2048, bias=False
                (loras): ModuleDict(
                  (ia3_adapter): LoRA()
                )
              )
              (o): Linear(in_features=2048, out_features=2048, bias=False)
              (rela

As you can see in the cell below, we check for the number of parameters which require their gradients to be computed. Lo and behold, we see that we only have to train 540k parameters which is a huge drop from around 3B parameters

In [11]:
pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total Parameters: ", pytorch_total_params) 

Total Parameters:  540672


Since we are going to be tuning the model using the huggingface Trainer, let us get our training data (stored in train.csv) into a "datasets" object. Yes, datasets is a package that is offered by hugging face for structuring/storing our data before training

In [12]:
from datasets import load_dataset

dataset = load_dataset('csv', data_files = { 
        "train": "train_data.csv"
        })

Downloading and preparing dataset csv/default to C:/Users/JkReddy/.cache/huggingface/datasets/csv/default-bbfcff9c7623a4c3/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to C:/Users/JkReddy/.cache/huggingface/datasets/csv/default-bbfcff9c7623a4c3/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [13]:
dataset = dataset.remove_columns("Unnamed: 0")
dataset

DatasetDict({
    train: Dataset({
        features: ['Instruction', 'Context', 'Question', 'Choices', 'Answer'],
        num_rows: 33
    })
})

Before I explain the next teo cells, lets take one more look at how our data is structured 

**Instruction: Read the Context below and answer the Question based by picking from the Choices provided below<br>
Context: The Clinician said, "His brother is homeless"<br>
Question: "In the clinician\'s opinon, was the person themself homeless?""<br>
Choices: Yes; No<br>
Answer:** **<font color=red>No</font>**

We have each of these sentences in separate columns as per best practices. But it would ideally be better if we could store the **bold** portion in one column and **<font color=red>No</font>** portion in another column before we pass everything to the Trainer. 

So, in the below two cells, we are gonna be dealing with concatenating the data and getting into all into two separate columns. Please feel free to perform whatever operations you may be comfortable with to get the data in the structure mentioned above. I will update this code later to simplify this portion. It definitely feels over engineered to me.

In [14]:
template = "Question: \"In the clinician's opinon, was the person themself homeless?\"\nChoices: Yes; No"
instruction = "Instruction: Read what the Clinician said in the Context below and answer the Question by choosing from the below provided Choices."

In [15]:
def add_template(examples):
    for i,ex in enumerate(examples['Context']):
        try:
#             print(ex)
#             print("h")
            examples['Context'][i] = instruction + "\nContext: " + ex + "\n" + template + "\nAnswer:"
            print(examples['Context'][i], end = "\n\n")
        except:   
            print("yikes")
            print(examples['Context'][i])
    return examples

dataset = dataset.map(add_template, batched = True)
dataset

Map:   0%|          | 0/33 [00:00<?, ? examples/s]

Instruction: Read what the Clinician said in the Context below and answer the Question by choosing from the below provided Choices.
Context: The Clinician said, "Social History: Pt is homeless."
Question: "In the clinician's opinon, was the person themself homeless?"
Choices: Yes; No
Answer:

Instruction: Read what the Clinician said in the Context below and answer the Question by choosing from the below provided Choices.
Context: The Clinician said, "Social History: He is homeless, and stays in a shelter."
Question: "In the clinician's opinon, was the person themself homeless?"
Choices: Yes; No
Answer:

Instruction: Read what the Clinician said in the Context below and answer the Question by choosing from the below provided Choices.
Context: The Clinician said, "He is homeless and currently staying with friends."
Question: "In the clinician's opinon, was the person themself homeless?"
Choices: Yes; No
Answer:

Instruction: Read what the Clinician said in the Context below and answer t

DatasetDict({
    train: Dataset({
        features: ['Instruction', 'Context', 'Question', 'Choices', 'Answer'],
        num_rows: 33
    })
})

In the next two cells, we are going to be tokenizing the "Data" (Context column) and "Label" (Answer column) 

In [16]:
def preprocess_function(examples):
    inputs = [ex for ex in examples["Context"]]
    targets = [ex for ex in examples["Answer"]]
    print(inputs)
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length= 512, truncation=True
    )
    return model_inputs

In [17]:
tokenized_datasets = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

Map:   0%|          | 0/33 [00:00<?, ? examples/s]

['Instruction: Read what the Clinician said in the Context below and answer the Question by choosing from the below provided Choices.\nContext: The Clinician said, "Social History: Pt is homeless."\nQuestion: "In the clinician\'s opinon, was the person themself homeless?"\nChoices: Yes; No\nAnswer:', 'Instruction: Read what the Clinician said in the Context below and answer the Question by choosing from the below provided Choices.\nContext: The Clinician said, "Social History: He is homeless, and stays in a shelter."\nQuestion: "In the clinician\'s opinon, was the person themself homeless?"\nChoices: Yes; No\nAnswer:', 'Instruction: Read what the Clinician said in the Context below and answer the Question by choosing from the below provided Choices.\nContext: The Clinician said, "He is homeless and currently staying with friends."\nQuestion: "In the clinician\'s opinon, was the person themself homeless?"\nChoices: Yes; No\nAnswer:', 'Instruction: Read what the Clinician said in the Con

Data Collators take care of padding the entities in our dataset and getting it ready for prime time. <br>
Why do we need padding you ask? Please read this article for more details: https://medium.com/@canerkilinc/padding-for-nlp-7dd8598c916a

In [18]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

We use the Hugging Face Trainer to Train the Model. We are setting the arguments for training in the cell below. Please feel free to read more about best parameters for the model you are going to be training. It is always good to have warmup steps (1% of the total steps) as this will result in a better model. Weight decay can also be set to a moderate 0.05 to ensure that we are not overfitting. I will update this notebook later with a detailed explanation of each of these parameters. We are not gonna be using a GPU for this example. But if you do want to use your Nvidia GPU with CUDA cores, set no_cuda to False. Let us run the model for 15 epochs. The output directory of the trained model will be "tuned_model/flan-t5-small"

In [19]:
from transformers import Seq2SeqTrainingArguments
epochs = 5
args = Seq2SeqTrainingArguments(
   f"tuned_model/flan-t5-xl_demo", 
#     evaluation_strategy="epoch",
#     save_strategy = "epoch",
    learning_rate = 3e-3,
#     logging_steps = 1,
    per_device_train_batch_size = 1,
    per_device_eval_batch_size = 1,
    weight_decay=0.05,
    num_train_epochs= epochs,
    predict_with_generate=True,
#     load_best_model_at_end = True,
#     metric_for_best_model = "eval_loss",
    greater_is_better = False,
    logging_strategy = "epoch",
    no_cuda = True,
#     save_total_limit = 2,   
    warmup_steps = (len(dataset["train"]) * epochs)/10,
    max_steps = len(dataset["train"]) * epochs
#     fp16=True,
#     fp16_full_eval = True
)

Creating the Trainer object which takes in the model, arguments which we previously initialized, data, data collator and tokenizer

In [20]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
#     eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
#     compute_metrics=compute_metrics,
#     callbacks=[early_stop],
)

max_steps is given, it will override any value given in num_train_epochs


Running the Trainer

In [21]:
trainer.train()

***** Running training *****
  Num examples = 33
  Num Epochs = 5
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 165
  Number of trainable parameters = 540672


Step,Training Loss
33,0.2883
66,0.1525
99,0.1087
132,0.0938
165,0.0447




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=165, training_loss=0.13760994492155132, metrics={'train_runtime': 856.0706, 'train_samples_per_second': 0.193, 'train_steps_per_second': 0.193, 'total_flos': 247263392563200.0, 'train_loss': 0.13760994492155132, 'epoch': 5.0})

Let us now merge the weights of the adapter with the model (and I guess you could delete the adapter)

In [22]:
trainer.model.merge_adapter("ia3_adapter")

Saving the final model in the folder mentioned in the arguments above

In [23]:
trainer.save_model()

Saving model checkpoint to tuned_model/flan-t5-xl_demo
Configuration saved in tuned_model/flan-t5-xl_demo\config.json
Configuration saved in tuned_model/flan-t5-xl_demo\generation_config.json
The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at tuned_model/flan-t5-xl_demo\pytorch_model.bin.index.json.
tokenizer config file saved in tuned_model/flan-t5-xl_demo\tokenizer_config.json
Special tokens file saved in tuned_model/flan-t5-xl_demo\special_tokens_map.json


Quantization and Inferencing is done in the next notebook:<br>
(Part 2)