Step 1: Data preprocessing

Load the edited_data.csv file containing all the chapters and verses including explanations
Preprocess the data by cleaning, tokenizing and removing stopwords

Step 2: Model training

Load the json file containing the set of questions and answers
Fine-tune a pre-trained language model such as GPT-3 on the dataset using transfer learning
Evaluate the performance of the model using a validation set

Step 3: Model testing

Ask the user to input a question
Use the fine-tuned model to generate an answer based on the input question

In [1]:
# Import necessary libraries
import pandas as pd
import json
import numpy as np
import re
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from transformers import GPT2LMHeadModel, GPT2Tokenizer, AdamW, get_linear_schedule_with_warmup
import torch


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\VICTUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load and preprocess the data
df = pd.read_csv('edited_data.csv')
df.head()

Unnamed: 0,Chapter No,Verse No,Shloka,English Translation,Explanation
0,1,1,धृतराष्ट्र उवाच | धर्मक्षेत्रे कुरुक्षेत्रे सम...,"Dhritarastra said: O Sanjaya, what did my sons...",The two armies had gathered on the battlefield...
1,1,2,सञ्जय उवाच । दृष्ट्वा तु पाण्डवानीकं व्यूढं दु...,"Sanjaya said: But then, seeing the army of the...","Sanjay understood Dhritarashtra’s concern, who..."
2,1,3,पश्यैतां पाण्डुपुत्राणामाचार्य महतीं चमूम् । व...,"O teacher, (please) see this vast army of the ...",Duryodhana asked Dronacharya to look at the sk...
3,1,4,अत्र शूरा महेष्वासा भीमार्जुनसमा युधि | युयुधा...,"There are in this army, heroes wielding great ...","Due to his anxiety, the Pandava army seemed mu..."
4,1,5,धृष्टकेतुश्चेकितान: काशिराजश्च वीर्यवान् | पुर...,"Dhrstaketu, Cekitana, and the valiant king of ...","Due to his anxiety, the Pandava army seemed mu..."


In [3]:

df = df[['Chapter No', 'Verse No', 'Explanation']]
df['text'] = df['Chapter No'].astype(str) + ' ' + df['Verse No'].astype(str) + ' ' + df['Explanation'].astype(str)
df.drop(['Chapter No', 'Verse No', 'Explanation'], axis=1, inplace=True)
df['text'] = df['text'].apply(lambda x: x.lower())
df['text'] = df['text'].apply(lambda x: re.sub(r'\d+', '', x))
df['text'] = df['text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords.words('english')]))
texts = df['text'].tolist()

In [4]:
# Load the questions and answers from the json file
with open('training_data.json') as file:
    data = json.load(file)

In [5]:
# Combine the prompts and completions to form the training data
prompts = []
completions = []
for item in data:
    prompts.append(item['prompt'])
    completions.append(item['completion'])

In [7]:
# Define the model and optimizer
model = GPT2LMHeadModel.from_pretrained(model)
optimizer = AdamW(model.parameters(), lr=1e-5)





This code sets up the learning rate schedule for the training process of the GPT-2 model. Here's what each line does:

# Set up the learning rate schedule: This is a comment that describes what the code does.
1. batch_size = 32: This sets the batch size for the training process. The batch size determines how many training examples are processed in each iteration of the training loop. In this case, each iteration will process 32 examples.

2. num_epochs = 10: This sets the number of training epochs. An epoch is a complete pass through the entire training dataset. In this case, the model will be trained for 10 epochs.

3. num_train_steps = len(training_inputs) // batch_size * num_epochs: This calculates the total number of training steps. Each training step processes one batch of training examples. The len(training_inputs) function returns the number of training examples. The // operator performs integer division to determine how many batches there are. Multiplying this by num_epochs gives the total number of training steps.

4. scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_train_steps): This sets up the learning rate schedule using the get_linear_schedule_with_warmup function from the transformers library. The learning rate schedule adjusts the learning rate during training to help the model converge faster and more accurately. The optimizer argument is the optimizer that will be used during training (in this case, AdamW). The num_warmup_steps argument is the number of warmup steps, where the learning rate is gradually increased from 0 to its initial value. The num_training_steps argument is the total number of training steps.




In [10]:
# Set up the learning rate schedule
batch_size = 32
num_epochs = 10
num_train_steps = len(training_inputs) // batch_size * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_train_steps)


In [None]:
# Train the model
for epoch in range(num_epochs):
    for i in range(0, len(training_inputs), batch_size):
        batch_inputs = training_inputs[i:i+batch_size]
        batch_outputs = training_outputs[i:i+batch_size]
        inputs = tokenizer(batch_inputs, padding=True, truncation=True, max_length=max_length, return_tensors='pt')
        outputs = tokenizer(batch_outputs, padding=True, truncation=True, max_length=max_length, return_tensors='pt')

        loss, _, _ = model(inputs['input_ids'], attention_mask=inputs['attention_mask'], labels=outputs['input_ids'], output_attentions=True, output_hidden_states=True)
        loss.backward()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

The code uses the Hugging Face Transformers library to generate text completions for a list of input texts using the GPT-2 model. Here is a breakdown of what each section of the code does:

1. Load the GPT-2 model and tokenizer using the GPT2Tokenizer and pipeline classes from the Transformers library.
2. Set the maximum length of the generated text using the max_length variable.
3. Generate text completions for each input text in the texts list by encoding the text using the tokenizer, generating a completion using the GPT-2 model, and decoding the completion using the tokenizer.
4. Extract the input_ids from each completion and store them in the completion_ids list.

In summary, the code takes a list of input texts, uses the GPT-2 model to generate text completions for each input text, and extracts the input_ids from the completions. The completion_ids list can then be used for downstream tasks such as fine-tuning the model or generating further text completions.

In [12]:
#from transformers import GPT2Tokenizer

# Load the GPT-2 tokenizer
#tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Set the maximum length of the input sequence
#max_length = 1024

# Tokenize the text and truncate to max_length
#input_ids = []
#for text in texts:
    #encoded = tokenizer.encode(text, truncation=True, max_length=max_length)
    #input_ids.append(encoded)


In [None]:
#from transformers import pipeline, GPT2Tokenizer

# Load the GPT-2 model and tokenizer
#model = "gpt2"
#tokenizer = GPT2Tokenizer.from_pretrained(model)
#gpt2 = pipeline("text-generation", model=model)

# Set the maximum length of the generated text
#max_length = 1024

# Generate text completions for each input text
#completions = []
#for text in texts:
    #encoded = tokenizer.encode(text, truncation=True, max_length=max_length)
    #completion = gpt2(tokenizer.decode(encoded), max_length=max_length)[0]
    #completions.append(completion)

# Extract the input_ids from the completions
#completion_ids = [c["input_ids"] for c in completions]



In [6]:
from transformers import pipeline, GPT2Tokenizer
import random

# Load the GPT-2 model and tokenizer
model = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model)
gpt2 = pipeline("text-generation", model=model)

# Set the maximum length of the generated text
max_length = 1024

# Preprocess the training data
training_inputs = []
training_outputs = []
for prompt, completion in zip(prompts, completions):
    # Tokenize the prompt and completion and truncate to max_length
    encoded_prompt = tokenizer.encode(prompt, truncation=True, max_length=max_length)
    encoded_completion = tokenizer.encode(completion, truncation=True, max_length=max_length)

    # Combine the prompt and completion into a single sequence
    input_ids = encoded_prompt + encoded_completion[1:]  # Remove the initial token <|endoftext|>


In [12]:
# Split the data into training and validation sets
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, completion_ids, random_state=42, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(input_ids, attention_masks, random_state=42, test_size=0.1)


NameError: name 'completion_ids' is not defined

In [11]:
# Tokenize the prompts and completions
max_length = 1024
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenized_prompts = tokenizer(prompts, truncation=True, max_length=max_length, padding=True)
tokenized_completions = tokenizer(completions, truncation=True, max_length=max_length, padding=True)

# Split the data into training and validation sets
train_inputs, val_inputs, train_labels, val_labels = train_test_split(tokenized_prompts['input_ids'], tokenized_completions['input_ids'], test_size=0.1, random_state=42)
train_masks, val_masks, _, _ = train_test_split(tokenized_prompts['attention_mask'], tokenized_completions['attention_mask'], test_size=0.1, random_state=42)


Using pad_token, but it is not set yet.


ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

The above had a time issue. 

This code uses the Hugging Face Transformers library to tokenize a list of input texts using the Roberta tokenizer. Here's a breakdown of what each section of the code does:

1. Load the Roberta tokenizer using the RobertaTokenizer class from the Transformers library.

2. Set the maximum length of the input sequence using the max_length variable.

3. Tokenize each text in the texts list using the tokenizer by encoding the text and truncating it to max_length.

4. Append the encoded text to the input_ids list.

In [8]:
from transformers import RobertaTokenizer

# Load the Roberta tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# Set the maximum length of the input sequence
max_length = 512

# Tokenize the text and truncate to max_length
input_ids = []
for text in texts:
    encoded = tokenizer.encode(text, truncation=True, max_length=max_length)
    input_ids.append(encoded)


In [9]:
# Split the data into training and validation sets
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, completion_ids, random_state=42, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(input_ids, attention_masks, random_state=42, test_size=0.1)


NameError: name 'completion_ids' is not defined

In [None]:


# Split the data into training and validation sets
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, completion_ids['input_ids'], random_state=42, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(input_ids, completion_ids['attention_mask'], random_state=42, test_size=0.1)

# Convert the data to PyTorch tensors
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch


In [None]:
#Convert the data to PyTorch tensors
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

#Create a DataLoader for the training and validation sets
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

batch_size = 8

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

#Load the pre-trained GPT-2 model and fine-tune it on the training data
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.resize_token_embeddings(len(tokenizer))

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

epochs = 3
learning_rate = 5e-5
warmup_steps = 1000
epsilon = 1e-8

optimizer = AdamW(model.parameters(), lr=learning_rate, eps=epsilon)
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)

for epoch in range(epochs):
print('Training epoch {}'.format(epoch+1))
total_loss = 0
model.train()
for step, batch in enumerate(train_dataloader):
b_inputs = batch[0].to(device)
b_masks = batch[1].to(device)
b_labels = batch[2].to(device)
model.zero_grad()
outputs = model(b_inputs, attention_mask=b_masks, labels=b_labels)
loss = outputs[0]
total_loss += loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
if (step+1) % 50 == 0:
print('Epoch: {}, Batch: {}, Loss: {}'.format(epoch+1, step+1, total_loss/50))

In [None]:
# Evaluate the model on the validation set
model.eval()
total_eval_loss = 0
for batch in validation_dataloader:
    b_inputs = batch[0].to(device)
    b_masks = batch[1].to(device)
    b_labels = batch[2].to(device)
    with torch.no_grad():
        outputs = model(b_inputs, attention_mask=b_masks, labels=b_labels)
    loss = outputs[0]
    total_eval_loss += loss.item()

average_train_loss = total_loss / len(train_dataloader)
average_eval_loss = total_eval_loss / len(validation_dataloader)

print('  Average training loss: {0:.2f}'.format(average_train_loss))
print('  Average validation loss: {0:.2f}'.format(average_eval_loss))

# Ask the user to input a question
while True:
    prompt = input("Ask a question about the Gita: ")

    # Tokenize the prompt and generate a response
    prompt_tokens = tokenizer.encode(prompt, add_special_tokens=True, return_tensors='pt')
    prompt_tokens = prompt_tokens.to(device)

    generated = model.generate(
        prompt_tokens,
        max_length=1000,
        temperature=0.7,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        num_return_sequences=1,
    )

    response = tokenizer.decode(generated[0], skip_special_tokens=True)

    # Print the response
    print(response)
