# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint


## Learning Objectives

At the end of the experiment, you will be able to:

* load and pre-process data from text file
* load and use a pre-trained tokenizer
* finetune a GPT-2 language model from Hugging Face's `transformers` library

## Dataset Description

The text data file is taken from one of the Project Gutenberg's eBooks named "***The Buddha's Path of Virtue: A Translation of the Dhammapada*** by F. L. Woodward", refer [here](https://www.gutenberg.org/files/35185/35185-h/35185-h.htm).

To know more about Project Gutenberg's eBooks, refer [here](https://www.gutenberg.org/).

### **GPT-2**

In recent years, the OpenAI GPT-2 exhibited an impressive ability to write coherent and passionate essays that exceeded what current language models can produce. The GPT-2 wasn't a particularly novel architecture - its architecture is very similar to the **decoder-only transformer**. The GPT2 was, however, a very large, transformer-based language model trained on a massive dataset.

Here, we are going to fine-tune the GPT2 model with the text of Project Gutenberg's eBook - The Buddha's Path of Virtue. We can expect that the model will be able to reply to the prompt related to the subject matter of this book after fine-tuning.

To know more about GPT-2, refer [here](http://jalammar.github.io/illustrated-gpt2/).

### Setup Steps:

### Importing required packages

In [1]:
import os
import re
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
import warnings
warnings.filterwarnings('ignore')

### Load the data

The data is in a text file (.txt)

Create functions to read text files:

In [2]:
!pip install datasets



In [3]:
from datasets import load_dataset
#UNCOMMENT BELOW TO TRAIN FULL DATA and comment out subset line
train_dataset = load_dataset("ssirikon/IIIT_D1_train", split = "train")
test_dataset = load_dataset("ssirikon/IIIT_D1_test", split = "test")

D1_train.csv:   0%|          | 0.00/367k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1316 [00:00<?, ? examples/s]

D1_test.csv:   0%|          | 0.00/79.4k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/999 [00:00<?, ? examples/s]

In [4]:
# Initialize the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Add special tokens for question and answer
special_tokens = {'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'pad_token': '<|PAD|>',
                  'additional_special_tokens': ['<|question|>', '<|answer|>']}
tokenizer.add_special_tokens(special_tokens)

def preprocess_function(examples):
  questions = examples['question']
  answers = examples['answer']

  # Concatenate question and answer with special tokens
  text = ['<|BOS|> <|question|> ' + q + ' <|answer|> ' + a + ' <|EOS|>' for q, a in zip(questions, answers)]

  # Tokenize the text and set labels
  tokenized_output = tokenizer(text, padding="max_length", truncation=True)
  tokenized_output['labels'] = tokenized_output['input_ids'].copy() # Add this line
  #print (tokenized_output)
  return tokenized_output



tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [5]:
# Preprocess the data
def preprocess_function_for_test(examples):
  questions = examples['question']
  answers1 = examples['answer1']
  answers2 = examples['answer2']

  # Concatenate question and answer with special tokens
  # Use str() to ensure all values are strings before concatenation
  inputs1 = ['<|BOS|> <|question|> ' + str(q) + ' <|answer|> ' + str(a) + ' <|EOS|>' for q, a in zip(questions, answers1)]
  inputs2 = ['<|BOS|> <|question|> ' + str(q) + ' <|answer|> ' + str(a) + ' <|EOS|>' for q, a in zip(questions, answers2)]

  # Tokenize the inputs
  tokenized_inputs1 = tokenizer(inputs1, padding="max_length", truncation=True)
  tokenized_inputs2 = tokenizer(inputs2, padding="max_length", truncation=True)

  # Find the start and end positions of the answers
  start_positions = []
  end_positions = []
  tokenized_inputs=tokenized_inputs1
  inputs=inputs1
  for i, input_ids in enumerate(tokenized_inputs['input_ids']):
    # Check if answer is None and handle accordingly
    if examples['answer1'][i] is None:
      start_positions.append(0) # or a suitable default value
      end_positions.append(0) # or a suitable default value
      continue

    answer_start = inputs[i].find('<|answer|>') + len('<|answer|>')
    answer_end = inputs[i].find('<|EOS|>')
    answer_ids = tokenizer.encode(inputs[i][answer_start:answer_end], add_special_tokens=False)

    # Find the start and end positions of answer_ids in input_ids
    j = 0
    while j < len(input_ids):
      if input_ids[j:j+len(answer_ids)] == answer_ids:
        start_positions.append(j)
        end_positions.append(j+len(answer_ids)-1)
        break
      j += 1

  # Add the start and end positions to the tokenized inputs
  tokenized_inputs['start_positions'] = start_positions
  tokenized_inputs['end_positions'] = end_positions
  return tokenized_inputs

In [6]:
# Apply the preprocessing function to your dataset
tokenized_train_datasets = train_dataset.map(preprocess_function, batched=True)
tokenized_test_datasets = test_dataset.map(preprocess_function_for_test, batched=True)

Map:   0%|          | 0/1316 [00:00<?, ? examples/s]

Map:   0%|          | 0/999 [00:00<?, ? examples/s]

In [7]:
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer

# Initialize the model with the updated tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.resize_token_embeddings(len(tokenizer))

model_output_path = "/content/gpt_model"
# Define training arguments
training_args = TrainingArguments(
    output_dir=model_output_path,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=5e-5,
    num_train_epochs=3,
    logging_dir='./logs',
    save_steps=10_000,
)

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_datasets,
    eval_dataset=tokenized_test_datasets
)

# Train the model
trainer.train(
    # Add these arguments
    #ignore_keys_for_eval=['past_key_values'],
)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Step,Training Loss
500,0.333


TrainOutput(global_step=987, training_loss=0.2297186769371458, metrics={'train_runtime': 1364.0733, 'train_samples_per_second': 2.894, 'train_steps_per_second': 0.724, 'total_flos': 2063161884672000.0, 'train_loss': 0.2297186769371458, 'epoch': 3.0})

### Pre-processing

- Remove any excess newline characters from the text

### Split the text into training and validation sets

### Load pre-trained tokenizer - GP2Tokenizer

The GPT2Tokenizer is based on ***Byte-Pair-Encoding***.

Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model.

In BPE, new tokens are added until the desired vocabulary size is reached by learning ***merges***, which are rules to merge two elements of the existing vocabulary together into a new one.

Below figure shows how the vocabulary updates as the BPE algorithm progresses.

<br>
<center>
<img src="https://cdn.iisc.talentsprint.com/AIandMLOps/Images/Byte-pair-encoding.png" width=450px>
</center>

To know more about Byte-Pair Encoding, refer [here](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt#byte-pair-encoding-tokenization).

<br>

Some of the parameters required to create a GP2Tokenizer includes:

- ***vocab_file (str):*** path to the vocabulary json file; maps token to integer ids

- ***merges_file (str):*** path to the ***merges*** file; contains the merge rule; The merge rule file should have one merge rule per line. Every merge rule contains merge entities separated by a space.



Here, we will instantiate a GPT-2 tokenizer from a predefined tokenizer using `from_pretrained()` method.

It includes a parameter:

- ***pretrained_model_name_or_path:*** It can be a string of a predefined tokenizer hosted inside a model repo on huggingface.co.

    For example: *gpt2, gpt2-medium, gpt2-large, or gpt2-xl*

    This will download the corresponding vocab, merges, and config files.

### Tokenize text data

### Data Collator

Data collators are objects that:

- will form a batch by using a list of dataset elements as input
- may apply some processing (like padding)

One of the data collators, `DataCollatorForLanguageModeling`, can also apply some random data augmentation (like random masking) on the formed batch.

<br>

`DataCollatorForLanguageModeling` is a data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.

Parameters:

- ***tokenizer:*** The tokenizer used for encoding the data.
- ***mlm*** (bool, optional, default=True): Whether or not to use masked language modeling.
    - If set to False, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100).
    - Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.
- ***return_tensors*** (str): The type of Tensor to return. Allowable values are “np”, “pt” and “tf” for numpy array, pytorch tensor, and tensorflow tensor respectively.

To know more about `DataCollatorForLanguageModeling` parameters, refer [here](https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling).

### Load pre-trained Model

***GPT2LMHeadModel*** is the GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).

This model is a PyTorch `torch.nn.Module` subclass which can be used as a regular PyTorch Module.

Parameters:

- ***config (GPT2Config):*** Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration.

Here, we will instantiate a pretrained pytorch model from a pre-trained model configuration, using `from_pretrained()` method, that will load the weights associated with the model.

**Note: The training time for different GPT models with GPU for this dataset are as follows:**

* **GPT-2 : ~20 minutes for 100 epochs**

* **GPT-2 Medium:  ~1 hour for 100 epochs**

* **GPT-2 Large : Run out of memory**

### Fine-tune Model

Train a GPT-2 model using the provided training arguments. Save the resulting trained model and tokenizer to a specified output directory.

The `Trainer` class provides an API for feature-complete training in PyTorch for most standard use cases.

Before instantiating your Trainer, create a `TrainingArguments` to access all the points of customization during training.

`TrainingArguments` parameters:

- ***output_dir*** (str): The output directory where the model predictions and checkpoints will be written.
- ***overwrite_output_dir*** (bool, optional, default=False): If True, overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.
- ***per_device_train_batch_size*** (int, optional, default=8): The batch size per GPU/TPU/MPS/NPU core/CPU for training.
- ***per_device_eval_batch_size*** (int, optional, default=8): The batch size per GPU/TPU/MPS/NPU core/CPU for evaluation.
- ***save_total_limit*** (int, optional): If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir.

To know more about `TrainingArguments` parameters, refer [here](https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/trainer#transformers.TrainingArguments).

To know more about `Trainer` parameters, refer [here](https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/trainer#transformers.Trainer).

### Test Model with user input prompts

##### Now, let us test the model with some prompt


The `generate_response()` function takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model.

In [8]:
# Save the model
trainer.save_model(model_output_path)

# Save the tokenizer
tokenizer.save_pretrained(model_output_path)

('/content/gpt_model/tokenizer_config.json',
 '/content/gpt_model/special_tokens_map.json',
 '/content/gpt_model/vocab.json',
 '/content/gpt_model/merges.txt',
 '/content/gpt_model/added_tokens.json')

In [9]:
def generate_response(model, tokenizer, prompt, max_length=100):

    input_ids = tokenizer.encode(prompt, return_tensors="pt")      # 'pt' for returning pytorch tensor

    # Create the attention mask and pad token id
    attention_mask = torch.ones_like(input_ids)
    pad_token_id = tokenizer.eos_token_id

    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        attention_mask=attention_mask,
        pad_token_id=pad_token_id
    )

    full_response =  tokenizer.decode(output[0], skip_special_tokens=True)

    # Split the response to get the answer
    try:
        # Find the index of '<|answer|>'
        answer_start_index = full_response.index('|answer|')
        # Extract the answer part (after '<|answer|>')
        answer = full_response[answer_start_index + len('|answer|'):].strip()
    except ValueError:
        # Handle cases where '<|answer|>' is not found
        print("Warning: '|answer|' not found in response. Returning the full response.")
        answer = full_response

    return answer


In [10]:
# Load the fine-tuned model and tokenizer

my_model = GPT2LMHeadModel.from_pretrained(model_output_path)
my_tokenizer = GPT2Tokenizer.from_pretrained(model_output_path)

In [11]:
# Testing with given prompt 1
q = "Describe a process/pipeline for generating representations from pre trained models?"
prompt = '<|BOS|> <|question|> ' + q + '|answer|'+''  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: The process/pipeline is a simple process that involves iteratively updating the model's weights, updating the model's convolutional and hyperparameter maps, and then iteratively updating the model's weights and convolutional weights to generate representations for later training.  How to generate representations from pre trained models?  The process of


In [12]:
# Testing with given prompt 1
q = "What is the difference between concatenation vs. summation of two tensors?"
prompt = '<|BOS|> <|question|> ' + q + '|answer|'+''  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: The difference between concatenation and summation is in the nature of the tensor. Concatenation is the process of converting two tensors into a single tensor. Summation is the process of converting two tensors into a single tensor.  How can we interpret the difference between concatenation and


In [13]:
# Testing with given prompt 1
q = "Why are derivatives substracted from weights?"
prompt = '<|BOS|> <|question|> ' + q + '|answer|'+''  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: The derivatives of a tensor are extracted from the weights, and the weights are then used to update the derivatives of the tensor.  How are derivatives extracted from weights?|Answer, derivatives are extracted from tensors by subtracting the derivatives from the weights.


In [14]:
# Testing with given prompt 1
q = "How we can effectively convert 2D images to 1D?"
prompt = '<|BOS|> <|question|> ' + q+ '|answer|'+''   # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: By converting 2D images to 1D, we can effectively convert them to 1D by converting their pixels to pixels.  How we can effectively convert 2D images to 1D?|answer|By converting 2 How we can effectively convert 2D images to 1D by converting their pixels to pixels.


In [15]:
# Testing with given prompt 2
q = "What is NLP's current biggest challenge that is being tried to overcome ?"  # Replace with your desired prompt
prompt = '<|BOS|> <|question|> ' + q + '|answer|'+''
response = generate_response(my_model, my_tokenizer, prompt, max_length=150)
print("Generated response:", response)

Generated response: NLP is a popular approach to overcome many of the problems in machine learning. It involves learning from data, applying machine learning techniques, and applying machine learning algorithms to solve complex problems.


In [16]:
# Testing with given prompt 3
q = "Is scaling necessary for SVM?"  # Replace with your desired prompt
prompt = '<|BOS|> <|question|> ' + q + '|answer|'+''
response = generate_response(my_model, my_tokenizer, prompt, max_length=150)
print("Generated response:", response)

Generated response: Yes, scaling is essential for SVM to maximize the number of nodes in the cluster.  The number of hidden layers in a SVM cluster depends on the size of the cluster.


In the case of the GPT-2 tokenizer, the model uses a byte-pair encoding (BPE) algorithm, which tokenizes text into subword units. As a result, one word might be represented by multiple tokens.

For example, if you set max_length to 50, the generated response will be limited to 50 tokens, which could be fewer than 50 words, depending on the text.

In [17]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=dc99cb9543a04c0925a6227770c50dd44b7362b3c565166c7fcbcacd93d0af3b
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [18]:
from rouge_score import rouge_scorer

In [19]:
!pip install pandas
import pandas as pd

def generate_responses_for_dataset(dataset):
    data = []  # List to store data for the table
    for i, example in enumerate(dataset):
        question = example.get('question')
        if question is not None:
            prompt = '<|BOS|> <|question|> ' + question + '|answer|'+''
            predicted_answer = generate_response(my_model, my_tokenizer, prompt)

            # Get the actual answer from the test data
            actual_answer = example.get('answer1')  # Assuming 'answer1' is the key for the actual answer

            # Append data to the list
            data.append([question, actual_answer, predicted_answer])
        else:
            #print("Warning: Skipping example due to missing 'question' value.")

    # Create a pandas DataFrame
    df = pd.DataFrame(data, columns=['Question', 'Actual Answer', 'Predicted Answer'])

    # Write the DataFrame to a CSV file
    df.to_csv('predictions.csv', index=False)  # Save to 'predictions.csv'
    print("Predictions saved to predictions.csv")

    return df  # Return the DataFrame (optional)



In [None]:
test_responses = generate_responses_for_dataset(test_dataset)

In [25]:
#print predictions csv as a table
test_responses = pd.read_csv('predictions.csv')
test_responses.head(120)


Unnamed: 0,Question,Actual Answer,Predicted Answer
0,How we can effectively convert 2D images to 1D?,Converting images to 1D data may not be effect...,"By converting 2D images to 1D, we can effectiv..."
1,Can we utilize an autoencoder to perform dimen...,"Yes, autoencoders can be applied to numerical ...","Yes, an autoencoder can be utilized to perform..."
2,What is NLP's current biggest challenge that i...,The main challenges of NLP is finding and coll...,NLP is a popular approach to overcome many of ...
3,Which problems cannot be solved by Neural netw...,While neural networks have shown great success...,Nanogamy is a popular problem in neural networ...
4,Is scaling necessary for SVM?,"Yes, scaling the input data is generally recom...","Yes, scaling is essential for SVM to maximize ..."
...,...,...,...
115,Can you repeat difference between data mining ...,Data mining refers to the process of discoveri...,"for example, do you have any examples of machi..."
116,Is there any software available for clinical l...,"CLAMP (Clinical Language Annotation, Modeling,...","Yes, there is a comprehensive database of medi..."
117,When do we slice?,Slicing is a useful technique in Python for ex...,The slicing algorithm is used to extract the e...
118,"In terms of obtaining better context, is lemma...","Yes, lemmatization is generally considered bet...",The lemmatization process is not necessarily s...


In [24]:
!pip install nltk
from rouge_score import rouge_scorer
import nltk
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score

# Download the necessary NLTK resources before using them
nltk.download('punkt')  # Download the 'punkt' resource for tokenization
nltk.download('wordnet') # Required for METEOR

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], use_stemmer=True) # Include rougeLsum
scores = []

for i in range(len(test_responses)):
    if i < len(test_dataset):
        reference = test_dataset[i]['answer1']
        # Access the 'Predicted Answer' column using .iloc[i] to get the prediction at index i
        prediction = test_responses['Predicted Answer'].iloc[i]

        if reference is not None and prediction is not None:
            # Calculate ROUGE scores
            rouge_scores = scorer.score(reference, prediction)

            # Calculate BLEU score
            reference_tokens = nltk.word_tokenize(reference)
            prediction_tokens = nltk.word_tokenize(prediction)
            bleu_score = sentence_bleu([reference_tokens], prediction_tokens)

            # Calculate METEOR score
            # Tokenize the prediction before passing it to meteor_score
            prediction_tokens = nltk.word_tokenize(prediction)
            meteor = meteor_score([nltk.word_tokenize(reference)], prediction_tokens)  # Tokenize reference here

            # Store all scores in a dictionary
            all_scores = {
                'rouge1': rouge_scores['rouge1'].fmeasure,
                'rouge2': rouge_scores['rouge2'].fmeasure,
                'rougeL': rouge_scores['rougeL'].fmeasure,
                'rougeLsum': rouge_scores['rougeLsum'].fmeasure,  # Added rougeLsum
                'bleu': bleu_score,
                'meteor': meteor
            }
            scores.append(all_scores)
        else:
            print(f"Warning: Skipping index {i} due to missing reference or prediction.")
    else:
        print(f"Warning: Skipping index {i} as it's out of range for test_dataset")
        break

if scores:
    avg_scores = {
        metric: sum([s[metric] for s in scores]) / len(scores)
        for metric in ['rouge1', 'rouge2', 'rougeL', 'rougeLsum', 'bleu', 'meteor']  # Include all metrics
    }
    print(avg_scores)
else:
    print("Warning: No scores calculated due to missing references or predictions.")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


{'rouge1': 0.27960045434462805, 'rouge2': 0.10651364371318459, 'rougeL': 0.2298777717936165, 'rougeLsum': 0.2326307595466042, 'bleu': 0.0370211055402379, 'meteor': 0.2699725994911716}
