<a href="https://colab.research.google.com/github/manikandannp/IIScRepo/blob/main/Finetune_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A Program by IISc and TalentSprint
### Assignment 1: Fine-tune GPT2

## Learning Objectives

At the end of the experiment, you will be able to:

* load and pre-process data from text file
* load and use a pre-trained tokenizer
* finetune a GPT-2 language model from Hugging Face's `transformers` library

## Dataset Description

The text data file is taken from one of the Project Gutenberg's eBooks named "***The Buddha's Path of Virtue: A Translation of the Dhammapada*** by F. L. Woodward", refer [here](https://www.gutenberg.org/files/35185/35185-h/35185-h.htm).

To know more about Project Gutenberg's eBooks, refer [here](https://www.gutenberg.org/).

### **GPT-2**

In recent years, the OpenAI GPT-2 exhibited an impressive ability to write coherent and passionate essays that exceeded what current language models can produce. The GPT-2 wasn't a particularly novel architecture - its architecture is very similar to the **decoder-only transformer**. The GPT2 was, however, a very large, transformer-based language model trained on a massive dataset.

Here, we are going to fine-tune the GPT2 model with the text of Project Gutenberg's eBook - The Buddha's Path of Virtue. We can expect that the model will be able to reply to the prompt related to the subject matter of this book after fine-tuning.

To know more about GPT-2, refer [here](http://jalammar.github.io/illustrated-gpt2/).

### Setup Steps:

In [1]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "2237883" #@param {type:"string"}

In [2]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "9986653311" #@param {type:"string"}

In [3]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "M6_AST_01_Finetune_GPT2_C" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")

    ipython.magic("sx pip install -U accelerate")
    ipython.magic("sx pip install -U transformers")
    ipython.magic("sx pip install torch")
    ipython.magic("sx wget https://www.gutenberg.org/files/35185/35185-0.txt")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://aimlops-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


### Importing required packages

In [4]:
import os
import re
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from transformers import pipeline, set_seed, GPT2Model
import warnings
warnings.filterwarnings('ignore')

### Load the data

The data is in a text file (.txt)

Create functions to read text files:

In [5]:
# Functions to read different file types

def read_txt(file_path):
    with open(file_path, "r") as file:
        text = file.read()
    return text

In [7]:
# Read files/documents

file_path = '/content/35185-0.txt'
file_path = '/content/agreement train.txt'
train_file = read_txt(file_path)
file_path = '/content/agreement validation.txt'
test_file = read_txt(file_path)

In [None]:
print(train_file)

In [None]:
print(test_file)

### Pre-processing

- Remove any excess newline characters from the text

### Split the text into training and validation sets

In [8]:
# Remove excess newline characters
train_file = re.sub(r'\n+', '\n', train_file).strip()
#print(train_file)

# Split the text into training and validation sets
train_fraction = 1.0
split_index = int(train_fraction * len(train_file))
train_text = train_file[:]

In [9]:
# Split the text into training and validation sets
test_file = re.sub(r'\n+', '\n', test_file).strip()
#print(test_file)

val_fraction = 1.0
split_index = int(val_fraction * len(test_file))
val_text = test_file[:]

In [10]:
# Save the training and validation data as text files
with open("train.txt", "w") as t:
    t.write(train_text)

with open("val.txt", "w") as v:
    v.write(val_text)

### Load pre-trained tokenizer - GPT2Tokenizer

The GPT2Tokenizer is based on ***Byte-Pair-Encoding***.

Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model.

In BPE, new tokens are added until the desired vocabulary size is reached by learning ***merges***, which are rules to merge two elements of the existing vocabulary together into a new one.

Below figure shows how the vocabulary updates as the BPE algorithm progresses.

<br>
<center>
<img src="https://cdn.iisc.talentsprint.com/AIandMLOps/Images/Byte-pair-encoding.png" width=450px>
</center>

To know more about Byte-Pair Encoding, refer [here](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt#byte-pair-encoding-tokenization).

<br>

Some of the parameters required to create a GP2Tokenizer includes:

- ***vocab_file (str):*** path to the vocabulary json file; maps token to integer ids

- ***merges_file (str):*** path to the ***merges*** file; contains the merge rule; The merge rule file should have one merge rule per line. Every merge rule contains merge entities separated by a space.



Here, we will instantiate a GPT-2 tokenizer from a predefined tokenizer using `from_pretrained()` method.

It includes a parameter:

- ***pretrained_model_name_or_path:*** It can be a string of a predefined tokenizer hosted inside a model repo on huggingface.co.

    For example: *gpt2, gpt2-medium, gpt2-large, or gpt2-xl*

    This will download the corresponding vocab, merges, and config files.

In [11]:
# Set up the tokenizer
set_seed(42)
checkpoint = "gpt2" #124M parameters
checkpoint = 'gpt2-medium' #355M parameters
checkpoint = 'gpt2-large' #774M parameters
checkpoint = "gpt2-xl" #1.5B parameters
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)    # also try gpt2, gpt2-large and gpt2-medium, also gpt2-xl

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

In [12]:
# Tokenize sample text using GP2Tokenizer
sample_ids = tokenizer("Hello world")
sample_ids

{'input_ids': [15496, 995], 'attention_mask': [1, 1]}

In [13]:
# Generate tokens for sample text
sample_tokens = tokenizer.convert_ids_to_tokens(sample_ids['input_ids'])
sample_tokens

['Hello', 'Ġworld']

In [14]:
# Generate original text back
tokenizer.convert_tokens_to_string(sample_tokens)

'Hello world'

### Tokenize text data

In [15]:
# Tokenize train text
train_dataset = TextDataset(tokenizer=tokenizer, file_path="train.txt", block_size=128)

# Tokenize validation text
val_dataset = TextDataset(tokenizer=tokenizer, file_path="val.txt", block_size=128)

In [16]:
# Length of train and validation set
len(train_dataset), len(val_dataset)

(38, 9)

In [17]:
# Batch-size
train_dataset[0].shape, val_dataset[0].shape

(torch.Size([128]), torch.Size([128]))

### Data Collator

Data collators are objects that:

- will form a batch by using a list of dataset elements as input
- may apply some processing (like padding)

One of the data collators, `DataCollatorForLanguageModeling`, can also apply some random data augmentation (like random masking) on the formed batch.

<br>

`DataCollatorForLanguageModeling` is a data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.

Parameters:

- ***tokenizer:*** The tokenizer used for encoding the data.
- ***mlm*** (bool, optional, default=True): Whether or not to use masked language modeling.
    - If set to False, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100).
    - Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.
- ***return_tensors*** (str): The type of Tensor to return. Allowable values are “np”, “pt” and “tf” for numpy array, pytorch tensor, and tensorflow tensor respectively.

To know more about `DataCollatorForLanguageModeling` parameters, refer [here](https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling).

In [18]:
# Create a Data collator object
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")

### Load pre-trained Model

***GPT2LMHeadModel*** is the GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).

This model is a PyTorch `torch.nn.Module` subclass which can be used as a regular PyTorch Module.

Parameters:

- ***config (GPT2Config):*** Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration.

Here, we will instantiate a pretrained pytorch model from a pre-trained model configuration, using `from_pretrained()` method, that will load the weights associated with the model.

In [19]:
# Set up the model
#model = GPT2LMHeadModel.from_pretrained(checkpoint)    # also try gpt2, gpt2-large and gpt2-medium, also gpt2-xl
model = GPT2Model.from_pretrained(checkpoint)    # also try gpt2, gpt2-large and gpt2-medium, also gpt2-xl

Downloading pytorch_model.bin:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

**Note: The training time for different GPT models with GPU for this dataset are as follows:**

* **GPT-2 : ~20 minutes for 100 epochs**

* **GPT-2 Medium:  ~1 hour for 100 epochs**

* **GPT-2 Large : Run out of memory**

### Fine-tune Model

Train a GPT-2 model using the provided training arguments. Save the resulting trained model and tokenizer to a specified output directory.

The `Trainer` class provides an API for feature-complete training in PyTorch for most standard use cases.

Before instantiating your Trainer, create a `TrainingArguments` to access all the points of customization during training.

`TrainingArguments` parameters:

- ***output_dir*** (str): The output directory where the model predictions and checkpoints will be written.
- ***overwrite_output_dir*** (bool, optional, default=False): If True, overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.
- ***per_device_train_batch_size*** (int, optional, default=8): The batch size per GPU/TPU/MPS/NPU core/CPU for training.
- ***per_device_eval_batch_size*** (int, optional, default=8): The batch size per GPU/TPU/MPS/NPU core/CPU for evaluation.
- ***save_total_limit*** (int, optional): If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir.

To know more about `TrainingArguments` parameters, refer [here](https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/trainer#transformers.TrainingArguments).

To know more about `Trainer` parameters, refer [here](https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/trainer#transformers.Trainer).

In [19]:
# Set up the training arguments

model_output_path = "/content/gpt_model"

training_args = TrainingArguments(
    output_dir = model_output_path,
    overwrite_output_dir = True,
    per_device_train_batch_size = 4, # try with 2
    per_device_eval_batch_size = 4,  #  try with 2
    num_train_epochs = 100,
    save_steps = 1_000,
    save_total_limit = 2,
    logging_dir = './logs',
    )

In [20]:
# Train the model
trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
)
trainer.train()

# Save the model
trainer.save_model(model_output_path)
# Save the tokenizer
tokenizer.save_pretrained(model_output_path)

Step,Training Loss
500,0.238


Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.


('/content/gpt_model/tokenizer_config.json',
 '/content/gpt_model/special_tokens_map.json',
 '/content/gpt_model/vocab.json',
 '/content/gpt_model/merges.txt',
 '/content/gpt_model/added_tokens.json')

### Test Model with user input prompts

##### Now, let us test the model with some prompt


The `generate_response()` function takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model.

In [21]:
def generate_response(model, tokenizer, prompt, max_length=100):

    input_ids = tokenizer.encode(prompt, return_tensors="pt")      # 'pt' for returning pytorch tensor

    # Create the attention mask and pad token id
    attention_mask = torch.ones_like(input_ids)
    pad_token_id = tokenizer.eos_token_id

    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        attention_mask=attention_mask,
        pad_token_id=pad_token_id
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)

In [22]:
# Load the fine-tuned model and tokenizer

my_model = GPT2LMHeadModel.from_pretrained(model_output_path)
my_tokenizer = GPT2Tokenizer.from_pretrained(model_output_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [34]:
# Testing with given prompt 1
prompt = "what defines WITNESS in the agreement?"  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: what defines WITNESS in the agreement?
A.
B.
C.
D.
E.
F.
G.
H.
                            
                                      


In [36]:
# Testing with given prompt 2
prompt = "What are the people mentioned in the agreement?"  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt, max_length=150)
print("Generated response:", response)

Generated response: What are the people mentioned in the agreement?
A:
The SELLER has paid a sum of Rs.­­­­­­­­­­­­(Rupees  only) by cash/ cheque /D.D. bearing No  drawn on  dated as advance, the receipt of which sum the SELLER hereby acknowledges.
The balance payment of Rs.(Rupees  only) will be paid by the PURCHASER to the SELLER at the time of execution of the absolute Sale Deed and thus completing the Sale transaction.
The parties herein covenant to complete the Sale transaction and to execute the Absolute Sale Deed by the end of
The SELLER confirms with the PURCH


In [30]:
# Testing with given prompt 3
#Sample1: Who is the SELLER for purchaser Mrs. UMA P ? Answer: Mr. MANIKANDAN s/o. PURUSHOTHAMAN
#Sample2: Who is the SELLER for purchaser Mr. BENGALURU ? Answer: Mr. KANNAN s/o. MANIKAM
#Sample3: Who is the SELLER for purchaser Mr. MANIKANDANNNP ? Answer: Mr. SHANMUGAM s/o. NAIKUM
#Sample4: Who is the SELLER for purchaser Mr. AMMANJI ? Answer:

prompt = """
Extract answer only from the document.
Sample1: Who is the SELLER for purchaser Mrs. UMA P ? Answer: Mr. MANIKANDAN s/o. PURUSHOTHAMAN
Sample2: Who is the SELLER for purchaser Mr. BENGALURU ? Answer: Mr. KANNAN s/o. MANIKAM
Sample3: Who is the SELLER for purchaser Mr. MANIKANDANNNP ? Answer: Mr. SHANMUGAM s/o. NAIKUM
Sample4: Who is the SELLER for purchaser Mr. AMMANJI ? Answer:
"""

response = generate_response(my_model, my_tokenizer, prompt, max_length=150)
print("Generated response:", response)

Generated response: 
Extract answer only from the document.
Sample1: Who is the SELLER for purchaser Mrs. UMA P? Answer: Mr. MANIKANDAN s/o. PURUSHOTHAMAN
Sample2: Who is the SELLER for purchaser Mr. BENGALURU? Answer: Mr. KANNAN s/o. MANIKAM
Sample3: Who is the SELLER for purchaser Mr. MANIKANDANNNP? Answer: Mr. SHANMUGAM s/o. NAIKUM
Sample4: Who is the SELLER for purchaser Mr. AMMANJI? Answer:
Mr. UMA P s /o ZELLOMART aged


In the case of the GPT-2 tokenizer, the model uses a byte-pair encoding (BPE) algorithm, which tokenizes text into subword units. As a result, one word might be represented by multiple tokens.

For example, if you set max_length to 50, the generated response will be limited to 50 tokens, which could be fewer than 50 words, depending on the text.

### Please answer the questions below to complete the experiment:




In [None]:
#@title The architecture of GPT is very similar to: { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "" #@param ["", "the encoder-only transformer", "the decoder-only transformer", "the encoder-decoder transformer", "none of the above"]

In [None]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [None]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}


In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "" #@param ["","Yes", "No"]


In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")