Import necessary libraries: This code imports the libraries needed for the script, including pandas, numpy, re, string, nltk, sklearn, transformers, and torch.

Define the model and tokenizer: This code defines the GPT-2 language model and tokenizer from the transformers library.

Load and preprocess the data: This code loads and preprocesses the training data from a CSV file. The data is first filtered to keep only the Chapter No, Verse No, and Explanation columns. The text column is then created by combining these three columns and cleaned by converting all text to lowercase, removing digits, removing punctuation, and removing stop words.

Load the questions and answers from the JSON file: This code loads the prompts and completions from a JSON file.

Combine the prompts and completions to form the training data: This code combines the prompts and completions to form the input-output pairs for training the language model.

Tokenize the inputs and outputs: This code tokenizes the input-output pairs using the GPT-2 tokenizer, pads them to the same length, and converts them to PyTorch tensors.

Set up the learning rate schedule: This code sets up the learning rate schedule for training the language model using the AdamW optimizer and a linear schedule with warmup.

Set up the device: This code sets up the device to use GPU if available, otherwise CPU.





In [1]:
# Import necessary libraries
import pandas as pd
import json
import numpy as np
import re
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from transformers import GPT2LMHeadModel, GPT2Tokenizer, AdamW, get_linear_schedule_with_warmup
import torch

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\VICTUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Define the model and tokenizer
model_name = 'gpt2-medium'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained(model_name)

In [3]:
# Load and preprocess the data
df = pd.read_csv('edited_data.csv')
df = df[['Chapter No', 'Verse No', 'Explanation']]
df['text'] = df['Chapter No'].astype(str) + ' ' + df['Verse No'].astype(str) + ' ' + df['Explanation'].astype(str)
df.drop(['Chapter No', 'Verse No', 'Explanation'], axis=1, inplace=True)
df['text'] = df['text'].apply(lambda x: x.lower())
df['text'] = df['text'].apply(lambda x: re.sub(r'\d+', '', x))
df['text'] = df['text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords.words('english')]))
texts = df['text'].tolist()

In [4]:
# Load the questions and answers from the json file
with open('training_data.json') as file:
    data = json.load(file)

# Combine the prompts and completions to form the training data
inputs = []
outputs = []
for item in data:
    inputs.append(item['prompt'])
    outputs.append(item['completion'])

In [5]:
# Tokenize the inputs and outputs
max_length = 512
inputs = tokenizer(inputs, padding=True, truncation=True, max_length=max_length, return_tensors='pt')
outputs = tokenizer(outputs, padding=True, truncation=True, max_length=max_length, return_tensors='pt')


In [6]:
# Set up the learning rate schedule
batch_size = 16  # reduce batch size
num_epochs = 10
num_train_steps = len(inputs['input_ids']) // batch_size * num_epochs
optimizer = AdamW(model.parameters(), lr=1e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_train_steps)




In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In this revised version, we first calculate the number of batches based on the input size and batch size, using integer division and adding 1 for any remaining samples in the last batch. We then use start_idx and end_idx to index into the input and output tensors for each batch. This avoids the need to add 1 to i+batch_size in the original code, which could cause an index out of range error if the last batch had fewer samples than the batch size. Additionally, we calculate the end index using min() to avoid indexing past the end of the input tensor. Finally, we use more descriptive variable names for clarity.

In [8]:
#batch_size = 32
#num_batches = (len(inputs['input_ids']) - 1) // batch_size + 1  # calculate number of batches

#for i in range(num_batches):
    #start_idx = i * batch_size
    #end_idx = min(start_idx + batch_size, len(inputs['input_ids']))  # last batch may have fewer samples
    #batch_inputs = inputs['input_ids'][start_idx:end_idx].to(device)
    #batch_outputs = inputs['input_ids'][start_idx+1:end_idx+1].to(device)
    #outputs = model(batch_inputs, attention_mask=inputs['attention_mask'][start_idx:end_idx], 
                    #labels=batch_outputs, output_attentions=True, output_hidden_states=True)
    #loss = outputs[0]
    #loss.backward()


In [9]:
batch_size = 32

for i in range(0, len(inputs['input_ids']) - batch_size + 1, batch_size):
    batch_inputs = inputs['input_ids'][i:i+batch_size].to(device)
    batch_outputs = inputs['input_ids'][i+1:i+batch_size+1].to(device)
    outputs = model(batch_inputs, attention_mask=inputs['attention_mask'][i:i+batch_size], labels=batch_outputs, output_attentions=True, output_hidden_states=True)
    loss = outputs[0]
    loss.backward()



In [None]:
# Train the model
for epoch in range(num_epochs):
    epoch_loss = 0
    for i in range(0, len(inputs['input_ids']), batch_size):
        batch_inputs = inputs['input_ids'][i:i+batch_size]
        batch_outputs = outputs['input_ids'][i:i+batch_size]
        
        # Adjust batch size of target tensor
        if len(batch_outputs) != len(batch_inputs):
            batch_outputs = batch_outputs[:len(batch_inputs)]

        outputs = model(batch_inputs, attention_mask=inputs['attention_mask'][i:i+batch_size], labels=batch_outputs, output_attentions=True, output_hidden_states=True)
        loss = outputs[0]
        loss.backward()
        optimizer.step()
        scheduler.step()
        model.zero_grad()
        
        epoch_loss += loss.item()
        
    print('Epoch:', epoch+1, 'Loss:', epoch_loss)


KeyError: 'input_ids'